-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out invalid and incomplete JSON #1058
Comments
I wondering if we want this middleware to be run after |
Yes, that sounds best. |
Hmm, I guess it would be hard to set a number, ideally, we could have a general percentage, e.g. 50% for all spiders, but we can only do that when the spider is complete (and already closed). Maybe the data registry should check for |
Yeah, we can do #531 instead (and the related issues in the data registry). |
Some publications have invalid JSON, but their spiders don't require any deserialization (the JSON stays as bytes). For these, we don't need to filter them out in Kingfisher Collect, as Kingfisher Process can handle invalid JSON. I think the condition to stay as bytes is:
So this new middleware can just |
Ah, going back to:
|
Re: my last two comments, the behavior is opt-in in #1066, which is also fine, as it spares some deserialization and reserialization in cases where e.g. we set |
If we can filter these out, then we can include more publications in the registry where this issue occurs in a small subset of all available files.
Typically, filtering is done in the item pipeline. However, spider middlewares run prior to the item pipeline, and we parse the JSON in these middlewares. (In some cases, we parse the JSON in the spider, but only when we have to in order to create URLs.)
Maybe we mark the item with a first middleware, and the other middlewares are skipped if that mark is present. The pipeline could then drop these marked items, and log the total.
Related to #1055, we maybe want to set a threshold such that the spider closes with a different reason if the threshold for invalid JSON files is exceeded.
The text was updated successfully, but these errors were encountered: