JFYI: matplotlib image differ tests #1

jankatins · 2016-04-02T23:34:54Z

This is mainly JFYI because it came up on twitter: matplotlib has a similar system in place to do unittesting on their images. It is also used in downstream packages like seaborn. The system is based on comparing raster images and compares the rasterized output of svg, tiff and ps backends to a baseline png which is included in the repo. rasterization is done with ghostscript. I suspect that the rasterize step is there because svgs can produce the same visual but have different internal representations (e.g. when plotting a point and a line, AFAIK the xml can contain point -> line and line -> point).

The workflow is:

write testcase with a name in a testfile
run once -> fails due to missing baseline images and produces a png image "result_images/testfile/name.png"
compare image with your expected image
If fine: copy the output to the baseline directory
run again -> baseline image is found and plot is compared by drawing the plot on three backends, saving the results (png+ps+svg), saterize svg+ps and comparing the rasterized image to the baseline image.

From my experience with this:

The tests should try very hard to make the available installed fonts the same on all test systems (e.g. bitstream vera or something, which can be expected to be available on dev machines and on travis/...; remove any fallbacks in the config; matplotlib actually has a font embedded in the package to have a default)
The outputs are not always completely the same due to different systems (e.g. different antialiasing strategies on linux/windows) -> matplotlib has a tolerance parameter for the comparison, but recently tried very hard to remove all non-zero values and was almost successful (but which got again worse when automatic windows tests were introduced).
mpl usually removes any text from a plot before it is drawn (a parameter to the comparison function), so different text rendering on axis labels on different systems is not the failure problem...
If tolerance is not zero, it's probably best to build plots which look ugly, like increasing the size of printed dots and such things, because small dots can be on totally different positions as expected but this isn't registered because of the tolerance...
To reproduce errors on travis/appveyor it's nice if the code spits out a directory which contains the images (+ baseline + diff + html with side-by-side placements of the images for visual inspection), so this can be uploaded (travis) or save as an artifact (appveyor)

A test looks like this:

@image_comparison(baseline_images=['log_scales'], remove_text=True)
def test_log_scales():
    ax = plt.subplot(122, yscale='log', xscale='symlog')

    ax.axvline(24.1)
    ax.axhline(24.1)

-> tests all three images formats (no extensions=['png]) and has a tolerance of 0 (no tol=x) and removes the text. baseline_images is a list because you can have multiple plots in a test (which is IMO not a nice feature...).

The main part is here: https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/testing/compare.py#L268 (mpl is license="BSD")

CC: @hrbrmstr because twitter... :-)

The text was updated successfully, but these errors were encountered:

hrbrmstr · 2016-04-03T00:38:03Z

nice. man, i wish there were some other way on both python and r to not use legacy linux font libs (i.e. a nice, modern, cross-platform font lib that support OTF wld be epic)

lionel- · 2016-04-03T14:40:41Z

Thanks for your insights Jan.

My main goal with the initial release of vdiffr is to offer a convenient UI for writing visual tests with testthat and managing failed cases with a workflow based on a Shiny app:

I chose to compare SVG files mainly for convenience. As good as svglite is, it does not offer a completely accurate rendition of R plots. But in most cases, complete accuracy is not necessary for the purpose of testing regressions. I wrote vdiffr with ggplot2 extensions in mind, which are more oriented towards data exploration than creating graphics for publication. The advantage of SVG is that I don't have to deal with tolerance.

It's certainly possible to add backends though. I like how you apply different testing strategies in one go.

jimhester · 2016-10-03T13:03:38Z

@JanSchulz Winston's vtest uses ImageMagik compare of raster images with a tolerence threshold, seems to be more what you had in mind. See https://github.com/hadley/ggplot2/tree/master/visual_test for usage in ggplot2.

clauswilke · 2018-08-19T03:09:56Z

This is an old issue, but since it's still open I'll add my two cents: I have found the comparison of svg's extremely valuable. The one thing I can do with svg's that I can't do with raster images is diff the new image against the old and hunt down exactly what has changed. I do this regularly, in particular when I don't see a difference visually but vdiffr tells me the images aren't the same. I find it helpful to understand why vdiffr thinks the images are different and what in the code changed to cause those differences. With raster images, you're mostly flying blind.

Example: this is a case where the visual tests failed because changes in the calculation of axis tick locations resulted in slightly different locations for the ticks and labels.
tidyverse/ggplot2@51c6d53#diff-c75903e4bd3c74e786f3b2825a1a804f

lionel- added the backend 🔧 label Dec 23, 2018

clauswilke mentioned this issue Feb 4, 2019

Add expect_doppelganger_raster() for raster-based comparisons? #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JFYI: matplotlib image differ tests #1

JFYI: matplotlib image differ tests #1

jankatins commented Apr 2, 2016

hrbrmstr commented Apr 3, 2016

lionel- commented Apr 3, 2016

jimhester commented Oct 3, 2016

clauswilke commented Aug 19, 2018

JFYI: matplotlib image differ tests #1

JFYI: matplotlib image differ tests #1

Comments

jankatins commented Apr 2, 2016

hrbrmstr commented Apr 3, 2016

lionel- commented Apr 3, 2016

jimhester commented Oct 3, 2016

clauswilke commented Aug 19, 2018