Matplotlib: Revisiting Text/Font Handling

To kick things off for the final report, here’s a meme to nudge about the previous blogs.

About Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations, which has become a de-facto Python plotting library.

Much of the implementation behind its font manager is inspired by W3C compliant algorithms, allowing users to interact with font properties like font-size, font-weight, font-family, etc.

However, the way Matplotlib handled fonts and general text layout was not ideal, which is what Summer 2021 was all about.

By “not ideal”, I do not mean that the library has design flaws, but that the design was engineered in the early 2000s, and is now outdated.

(..more on this later)

About the Project

(PS: here’s the link to my GSoC proposal, if you’re interested)

Overall, the project was divided into two major subgoals:

  1. Font Subsetting
  2. Font Fallback

But before we take each of them on, we should get an idea about some basic terminology for fonts (which are a lot, and are rightly confusing)

The PR: Clarify/Improve docs on family-names vs generic-families brings about a bit of clarity about some of these terms. The next section has a linked PR which also explains the types of fonts and how that is relevant to Matplotlib.

Font Subsetting

An easy-to-read guide on Fonts and Matplotlib was created with PR: [Doc] Font Types and Font Subsetting, which is currently live at Matplotlib’s DevDocs.

Taking an excerpt from one of my previous blogs (and the doc):

Fonts can be considered as a collection of these glyphs, so ultimately the goal of subsetting is to find out which glyphs are required for a certain array of characters, and embed only those within the output.

PDF, PS/EPS and SVG output document formats are special, as in the text within them can be editable, i.e, one can copy/search text from documents (for eg, from a PDF file) if the text is editable.

Matplotlib and Subsetting

The PDF, PS/EPS and SVG backends used to support font subsetting, only for a few types. What that means is, before Summer ‘21, Matplotlib could generate Type 3 subsets for PDF, PS/EPS backends, but it could not generate Type 42 / TrueType subsets.

With PR: Type42 subsetting in PS/PDF merged in, users can expect their PDF/PS/EPS documents to contains subsetted glyphs from the original fonts.

This is especially benefitial for people who wish to use commercial (or CJK) fonts. Licenses for many fonts require subsetting such that they can’t be trivially copied from the output files generated from Matplotlib.

Font Fallback

Matplotlib was designed to work with a single font at runtime. A user could specify a font.family, which was supposed to correspond to CSS properties, but that was only used to find a single font present on the user’s system.

Once that font was found (which is almost always found, since Matplotlib ships with a set of default fonts), all the user text was rendered only through that font. (which used to give out “tofu” if a character wasn’t found)


It might seem like an outdated approach for text rendering, now that we have these concepts like font-fallback, but these concepts weren’t very well discussed in early 2000s. Even getting a single font to work was considered a hard engineering problem.

This was primarily because of the lack of any standardization for representation of fonts (Adobe had their own font representation, and so did Apple, Microsoft, etc.)

Previous After

Previous (notice Tofus) VS After (CJK font as fallback)

To migrate from a font-first approach to a text-first approach, there are multiple steps involved:

Parsing the whole font family

The very first (and crucial!) step is to get to a point where we have multiple font paths (ideally individual font files for the whole family). That is achieved with either:

Quoting one of my previous blogs:

Don’t break, a lot at stake!

My first approach was to change the existing public findfont API to incorporate multiple filepaths. Since Matplotlib has a very huge userbase, there’s a high chance it would break a chunk of people’s workflow:

FamilyParsingFlowChart First PR (left), Second PR (right)

FT2Font Overhaul

Once we get a list of font paths, we need to change the internal representation of a “font”. Matplotlib has a utility called FT2Font, which is written in C++, and used with wrappers as a Python extension, which in turn is used throughout the backends. For all intents and purposes, it used to mean: FT2Font === SingleFont (if you’re interested, here’s a meme about how FT2Font was named!)

But that is not the case anymore, here’s a flowchart to explain what happens now:

FamilyParsingFlowChart Font-Fallback Algorithm

With PR: Implement Font-Fallback in Matplotlib, every FT2Font object has a std::vector<FT2Font *> fallback_list, which is used for filling the parent cache, as can be seen in the self-explanatory flowchart.

For simplicity, only one type of cache (character -> FT2Font) is shown, whereas in actual implementation there’s 2 types of caches, one shown above, and another for glyphs (glyph_id -> FT2Font).

Note: Only the parent’s APIs are used in some backends, so for each of the individual public functions like load_glyph, load_char, get_kerning, etc., we find the FT2Font object which has that glyph from the parent FT2Font cache!

Multi-Font embedding in PDF/PS/EPS

Now that we have multiple fonts to render a string, we also need to embed them for those special backends (i.e., PDF/PS, etc.). This was done with some patches to specific backends:

With this, one could create a PDF or a PS/EPS document with multiple fonts which are embedded (and subsetted!).

Conclusion

From small contributions to eventually working on a core module of such a huge library, the road was not what I had imagined, and I learnt a lot while designing solutions to these problems.

The work I did would eventually end up affecting every single Matplotlib user.

…since all plots will work their way through the new codepath!

I think that single statement is worth the whole GSoC project.

Pull Request Statistics

For the sake of statistics (and to make GSoC sound a bit less intimidating), here’s a list of contributions I made to Matplotlib before Summer ‘21, most of which are only a few lines of diff:

Created At PR Title Diff Status
Nov 2, 2020 Expand ScalarMappable.set_array to accept array-like inputs (+28 −4) MERGED
Nov 8, 2020 Add overset and underset support for mathtext (+71 −0) MERGED
Nov 14, 2020 Strictly increasing check with test coverage for streamplot grid (+54 −2) MERGED
Jan 11, 2021 WIP: Add support to edit subplot configurations via textbox (+51 −11) DRAFT
Jan 18, 2021 Fix over/under mathtext symbols (+7,459 −4,169) MERGED
Feb 11, 2021 Add overset/underset whatsnew entry (+28 −17) MERGED
May 15, 2021 Warn user when mathtext font is used for ticks (+28 −0) MERGED

Here’s a list of PRs I opened during Summer'21:

Acknowledgements

From learning about software engineering fundamentals from Tom to learning about nitty-gritty details about font representations from Jouni;

From learning through Antony’s patches and pointers to receiving amazing feedback on these blogs from Hannah, it has been an adventure! 💯

Special Mentions: Frank, Srijan and Atharva for their helping hands!

And lastly, you, the reader; if you’ve been following my previous blogs, or if you’ve landed at this one directly, I thank you nevertheless. (one last meme, I promise!)

I know I speak for every developer out there, when I say it means a lot when you choose to look at their journey or their work product; it could as well be a tiny website, or it could be as big as designing a complete library!


I’m grateful to Maptlotlib (under the parent organisation: NumFOCUS), and of course, Google Summer of Code for this incredible learning opportunity.

Farewell, reader! :’)

MatplotlibGSoC Consider contributing to Matplotlib (Open Source in general) ❤️

NOTE: This blog post is also available at my personal website.