-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1058 filter invalid json files #1066
Conversation
… into 1058-filter-invalid
Did you do crawls for the individual spiders? I would be interested to know if the invalid JSON are a small or large proportion. (You can do a sample of 100 or something.) |
From #964
I assume you are okay with closing the issue and leaving those limits for now. |
- Rename check_json_format to validate_json - Rename invalid_format to invalid_json, as invalid format can mean e.g. using XML instead of JSON - Move invalid_json to KingfisherFileItem, so that it is not available on FileError - Re-order documentation, class attributes and item fields to match processing order
…to log the key used for deduplication in the case of invalid JSON
a55017d
to
8e495ae
Compare
8e495ae
to
66540c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use the spider's logger to log what we dropped (like in LogFormatter)
- I think we also wanted to update the spider's stats with the number of invalid JSON items.
ab7d02a
to
c047bbd
Compare
I tested it with two Nigerian states, one dropped 1 file, and the other 2. I could try with all the affected spiders if you want, or do it in the registry directly |
This approach fails when the data is compressed, as the middleware runs before the data is decompressed, e.g. with |
Sure, please check the registry logs when the time comes.
Aha, for Croatia we had |
We can add ValidateJSONMiddleware to the |
… into 1058-filter-invalid
…gfisher-collect into 1058-filter-invalid
And also replace I tested locally the spiders that are small and faster enough with these results: nigeria_anambra_state: 23/169 (13,6% missing) (one ocid per file) |
@jpmckinney I've included the test, do you think we are ok to merge this now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bit of refactoring. Should be good now.
closes #1058
closes #1036
closes #964
closes #963
closes #957
closes #886
closes #876
closes #645