Scraper regression tester? #8

estaub · 2019-10-12T19:05:34Z

estaub
Oct 12, 2019

In working on scrapers I often find that they could use some serious refactoring, but I'm too scared of breaking something to take it on. It seems like some kind of regression testing framework might help.

I'm thinking of a framework that would run a current subscraper, then run the new subscraper, and compare for unexpected differences. By "subscraper" I mean e.g. a bill scraper, vote scraper, people scraper, or whatever.

This would obviously be for manual use in the development process, not e.g. from Travis CI testing.

Does such a thing exist already somewhere?

This may belong better in pupa; I think it makes sense to rough it out here, first, though.

showerst · 2019-10-15T13:56:52Z

showerst
Oct 15, 2019
Collaborator

I don't think anything like this exists right now -- there's sort of two different use cases:

Refactoring the existing code without adding/removing. -- One way to do this is to have both output JSON files with --scrape and then use a script to compare the outputs. It's slightly hampered by python's json encoding not being idempotent (it often outputs objects with keys in different order from run to run), but I think they may have fixed this in 3.7? I usually do this by hand by side-by-siding them in vscode, but if you ran it through a predictable json serializer one could just md5 hash 'em. Even without you could write a script that parses both and walks the keys. I believe GH has some code to alter the json outputter to always emit keys in the same order, I'll look for that.
Adding new bits -- Similar to 1, we just couldn't hash, it would have to be a 'walk the keys and highlight differences' approach.

0 replies

estaub · 2019-10-15T17:09:51Z

estaub
Oct 15, 2019
Author

@showerst Thanks. I've done similar manual things, but of course I'm looking to scale up.

I was thinking of two parsing passes.
A first pass would build up a mapping of pseudo-ids to filenames for the files in each fileset.
The two sets of pseudo-ids would be diffed and any differences reported.

Using the maps from the first pass, a second pass would diff each pair of pseudo-id-matching files and report differences, ignoring timestamps.

Diffing could be done by something like deepdiff.

Does that make sense?

0 replies

jamesturk · 2020-07-10T01:47:09Z

jamesturk
Jul 10, 2020
Maintainer

hoping to integrate this as part of openstates/issues#85

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper regression tester? #8

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Scraper regression tester? #8

estaub Oct 12, 2019

Replies: 3 comments

showerst Oct 15, 2019 Collaborator

estaub Oct 15, 2019 Author

jamesturk Jul 10, 2020 Maintainer

estaub
Oct 12, 2019

showerst
Oct 15, 2019
Collaborator

estaub
Oct 15, 2019
Author

jamesturk
Jul 10, 2020
Maintainer