GSoC'21: Final Report

Matplotlib: Revisiting Text/Font Handling

To kick things off for the final report, here’s a meme to nudge about the previous blogs.

About Matplotlib#

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations, which has become a de-facto Python plotting library.

Much of the implementation behind its font manager is inspired by W3C compliant algorithms, allowing users to interact with font properties like font-size, font-weight, font-family, etc.

However, the way Matplotlib handled fonts and general text layout was not ideal, which is what Summer 2021 was all about.#

By “not ideal”, I do not mean that the library has design flaws, but that the design was engineered in the early 2000s, and is now outdated.

(..more on this later)

About the Project#

(PS: here’s the link to my GSoC proposal, if you’re interested)

Overall, the project was divided into two major subgoals:

Font Subsetting
Font Fallback

But before we take each of them on, we should get an idea about some basic terminology for fonts (which are a lot, and are rightly confusing)

The PR: Clarify/Improve docs on family-names vs generic-families brings about a bit of clarity about some of these terms. The next section has a linked PR which also explains the types of fonts and how that is relevant to Matplotlib.

Font Subsetting#

An easy-to-read guide on Fonts and Matplotlib was created with PR: [Doc] Font Types and Font Subsetting, which is currently live at Matplotlib’s DevDocs.

Taking an excerpt from one of my previous blogs (and the doc):

Fonts can be considered as a collection of these glyphs, so ultimately the goal of subsetting is to find out which glyphs are required for a certain array of characters, and embed only those within the output.

PDF, PS/EPS and SVG output document formats are special, as in the text within them can be editable, i.e, one can copy/search text from documents (for eg, from a PDF file) if the text is editable.

Matplotlib and Subsetting#

The PDF, PS/EPS and SVG backends used to support font subsetting, only for a few types. What that means is, before Summer ‘21, Matplotlib could generate Type 3 subsets for PDF, PS/EPS backends, but it could not generate Type 42 / TrueType subsets.

With PR: Type42 subsetting in PS/PDF merged in, users can expect their PDF/PS/EPS documents to contains subsetted glyphs from the original fonts.

This is especially beneficial for people who wish to use commercial (or CJK) fonts. Licenses for many fonts require subsetting such that they can’t be trivially copied from the output files generated from Matplotlib.

Font Fallback#

Matplotlib was designed to work with a single font at runtime. A user could specify a font.family, which was supposed to correspond to CSS properties, but that was only used to find a single font present on the user’s system.

Once that font was found (which is almost always found, since Matplotlib ships with a set of default fonts), all the user text was rendered only through that font. (which used to give out “tofu” if a character wasn’t found)

It might seem like an outdated approach for text rendering, now that we have these concepts like font-fallback, but these concepts weren’t very well discussed in early 2000s. Even getting a single font to work was considered a hard engineering problem.

This was primarily because of the lack of any standardization for representation of fonts (Adobe had their own font representation, and so did Apple, Microsoft, etc.)

Previous (notice Tofus) VS After (CJK font as fallback)

To migrate from a font-first approach to a text-first approach, there are multiple steps involved:

Parsing the whole font family#

The very first (and crucial!) step is to get to a point where we have multiple font paths (ideally individual font files for the whole family). That is achieved with either:

Quoting one of my previous blogs:

Don’t break, a lot at stake!

My first approach was to change the existing public findfont API to incorporate multiple filepaths. Since Matplotlib has a very huge userbase, there’s a high chance it would break a chunk of people’s workflow:

FamilyParsingFlowChart First PR (left), Second PR (right)

FT2Font Overhaul#

Once we get a list of font paths, we need to change the internal representation of a “font”. Matplotlib has a utility called FT2Font, which is written in C++, and used with wrappers as a Python extension, which in turn is used throughout the backends. For all intents and purposes, it used to mean: FT2Font === SingleFont (if you’re interested, here’s a meme about how FT2Font was named!)

But that is not the case anymore, here’s a flowchart to explain what happens now:

FamilyParsingFlowChart Font-Fallback Algorithm

With PR: Implement Font-Fallback in Matplotlib, every FT2Font object has a std::vector<FT2Font *> fallback_list, which is used for filling the parent cache, as can be seen in the self-explanatory flowchart.

For simplicity, only one type of cache (character -> FT2Font) is shown, whereas in actual implementation there’s 2 types of caches, one shown above, and another for glyphs (glyph_id -> FT2Font).

Note: Only the parent’s APIs are used in some backends, so for each of the individual public functions like load_glyph, load_char, get_kerning, etc., we find the FT2Font object which has that glyph from the parent FT2Font cache!

Multi-Font embedding in PDF/PS/EPS#

Now that we have multiple fonts to render a string, we also need to embed them for those special backends (i.e., PDF/PS, etc.). This was done with some patches to specific backends:

With this, one could create a PDF or a PS/EPS document with multiple fonts which are embedded (and subsetted!).

Conclusion#

From small contributions to eventually working on a core module of such a huge library, the road was not what I had imagined, and I learnt a lot while designing solutions to these problems.

The work I did would eventually end up affecting every single Matplotlib user.#

…since all plots will work their way through the new codepath!

I think that single statement is worth the whole GSoC project.

Pull Request Statistics#

For the sake of statistics (and to make GSoC sound a bit less intimidating), here’s a list of contributions I made to Matplotlib before Summer ‘21, most of which are only a few lines of diff:

Created At	PR Title	Diff	Status
Nov 2, 2020	Expand ScalarMappable.set_array to accept array-like inputs	(+28 −4)	MERGED
Nov 8, 2020	Add overset and underset support for mathtext	(+71 −0)	MERGED
Nov 14, 2020	Strictly increasing check with test coverage for streamplot grid	(+54 −2)	MERGED
Jan 11, 2021	WIP: Add support to edit subplot configurations via textbox	(+51 −11)	DRAFT
Jan 18, 2021	Fix over/under mathtext symbols	(+7,459 −4,169)	MERGED
Feb 11, 2021	Add overset/underset whatsnew entry	(+28 −17)	MERGED
May 15, 2021	Warn user when mathtext font is used for ticks	(+28 −0)	MERGED

Here’s a list of PRs I opened during Summer'21:

[Status: ✅] Clarify/Improve docs on family-names vs generic-families
[Status: ✅] Add parse_math in Text and default it False for TextBox
[Status: ✅] Type42 subsetting in PS/PDF
[Status: ✅] [Doc] Font Types and Font Subsetting
[Status: 🚧] [with findfont diff] Parsing all families in font_manager
[Status: 🚧] [without findfont diff] Parsing all families in font_manager
[Status: 🚧] Implement Font-Fallback in Matplotlib
[Status: 🚧] Implement multi-font embedding for PDF Backend
[Status: 🚧] Implement multi-font embedding for PS Backend

Acknowledgements#

From learning about software engineering fundamentals from Tom to learning about nitty-gritty details about font representations from Jouni;

From learning through Antony’s patches and pointers to receiving amazing feedback on these blogs from Hannah, it has been an adventure! 💯

Special Mentions: Frank, Srijan and Atharva for their helping hands!

And lastly, you, the reader; if you’ve been following my previous blogs, or if you’ve landed at this one directly, I thank you nevertheless. (one last meme, I promise!)

I know I speak for every developer out there, when I say it means a lot when you choose to look at their journey or their work product; it could as well be a tiny website, or it could be as big as designing a complete library!

I’m grateful to Maptlotlib (under the parent organisation: NumFOCUS), and of course, Google Summer of Code for this incredible learning opportunity.

Farewell, reader! :’)

MatplotlibGSoC Consider contributing to Matplotlib (Open Source in general) ❤️