Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1058 filter invalid json files #1066

Merged
merged 28 commits into from
Apr 11, 2024
Merged

1058 filter invalid json files #1066

merged 28 commits into from
Apr 11, 2024

Conversation

yolile
Copy link
Member

@yolile yolile commented Apr 10, 2024

closes #1058
closes #1036
closes #964
closes #963
closes #957
closes #886
closes #876
closes #645

@yolile yolile requested a review from jpmckinney April 10, 2024 21:11
@jpmckinney
Copy link
Member

Did you do crawls for the individual spiders? I would be interested to know if the invalid JSON are a small or large proportion. (You can do a sample of 100 or something.)

@jpmckinney
Copy link
Member

From #964

With f8d47ca this problem didn't happen again. However, as the publisher is working on a better approach to generate the files, I will keep this issue open so we can remove the download limitations from the spider when the new solution is implemented.

I assume you are okay with closing the issue and leaving those limits for now.

- Rename check_json_format to validate_json
- Rename invalid_format to invalid_json, as invalid format can mean e.g. using XML instead of JSON
- Move invalid_json to KingfisherFileItem, so that it is not available on FileError
- Re-order documentation, class attributes and item fields to match processing order
…to log the key used for deduplication in the case of invalid JSON
@jpmckinney jpmckinney force-pushed the 1058-filter-invalid branch 2 times, most recently from a55017d to 8e495ae Compare April 10, 2024 23:43
Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • use the spider's logger to log what we dropped (like in LogFormatter)
  • I think we also wanted to update the spider's stats with the number of invalid JSON items.

jpmckinney
jpmckinney previously approved these changes Apr 11, 2024
@jpmckinney jpmckinney mentioned this pull request Apr 11, 2024
@yolile
Copy link
Member Author

yolile commented Apr 11, 2024

Did you do crawls for the individual spiders?

I tested it with two Nigerian states, one dropped 1 file, and the other 2. I could try with all the affected spiders if you want, or do it in the registry directly

@yolile yolile requested a review from jpmckinney April 11, 2024 17:33
@yolile
Copy link
Member Author

yolile commented Apr 11, 2024

This approach fails when the data is compressed, as the middleware runs before the data is decompressed, e.g. with croatia.

@jpmckinney
Copy link
Member

I tested it with two Nigerian states, one dropped 1 file, and the other 2. I could try with all the affected spiders if you want, or do it in the registry directly

Sure, please check the registry logs when the time comes.

This approach fails when the data is compressed, as the middleware runs before the data is decompressed, e.g. with croatia.

Aha, for Croatia we had ReadDataMiddleware read the data (if it survived that long without being read). Fixed in d6e8a7f (similar to AddPackageMiddleware)

@jpmckinney
Copy link
Member

We can add ValidateJSONMiddleware to the test_bytes_or_file test to cover this.

@yolile
Copy link
Member Author

yolile commented Apr 11, 2024

Aha, for Croatia we had ReadDataMiddleware read the data

And also replace item.data with data.read(), otherwise, we get empty JSON files.

I tested locally the spiders that are small and faster enough with these results:

nigeria_anambra_state: 23/169 (13,6% missing) (one ocid per file)
nigeria_osun_state: 2/164 (1% missing) (one ocid per file)
nigeria_enugu_state: 1/7 (14% missing) (one year per file, 2020 invalid, but 2023 and 2024 without releases)
croatia: 9/65 (13,8% missing) (one month per file)
costa_rica_poder_judicial_releases: 1306/11991 (10% missing) (one ocid per file)
pakistan_ppra_api is down
panama_dgcp_records: too slow/big

@yolile
Copy link
Member Author

yolile commented Apr 11, 2024

@jpmckinney I've included the test, do you think we are ok to merge this now?

Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit of refactoring. Should be good now.

@yolile yolile merged commit e366664 into main Apr 11, 2024
10 checks passed
@yolile yolile deleted the 1058-filter-invalid branch April 11, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment