Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out invalid and incomplete JSON #1058

Closed
jpmckinney opened this issue Feb 13, 2024 · 7 comments · Fixed by #1066
Closed

Filter out invalid and incomplete JSON #1058

jpmckinney opened this issue Feb 13, 2024 · 7 comments · Fixed by #1066
Assignees
Labels
framework-items Relating to how we process items
Milestone

Comments

@jpmckinney
Copy link
Member

If we can filter these out, then we can include more publications in the registry where this issue occurs in a small subset of all available files.

Typically, filtering is done in the item pipeline. However, spider middlewares run prior to the item pipeline, and we parse the JSON in these middlewares. (In some cases, we parse the JSON in the spider, but only when we have to in order to create URLs.)

Maybe we mark the item with a first middleware, and the other middlewares are skipped if that mark is present. The pipeline could then drop these marked items, and log the total.

Related to #1055, we maybe want to set a threshold such that the spider closes with a different reason if the threshold for invalid JSON files is exceeded.

@jpmckinney jpmckinney added the framework-items Relating to how we process items label Feb 13, 2024
@jpmckinney jpmckinney added this to the Priority milestone Feb 13, 2024
@yolile
Copy link
Member

yolile commented Apr 8, 2024

I wondering if we want this middleware to be run after ConcatenatedJSONMiddleware and LineDelimitedMiddleware, in case only one item of a file is invalid

@jpmckinney
Copy link
Member Author

Yes, that sounds best.

@yolile
Copy link
Member

yolile commented Apr 8, 2024

Related to #1055, we maybe want to set a threshold such that the spider closes with a different reason if the threshold for invalid JSON files is exceeded.

Hmm, I guess it would be hard to set a number, ideally, we could have a general percentage, e.g. 50% for all spiders, but we can only do that when the spider is complete (and already closed). Maybe the data registry should check for item_dropped_count and item_scraped_count statistics and if the dropped count is more than 50%, then do not continue processing the collection

@jpmckinney
Copy link
Member Author

Yeah, we can do #531 instead (and the related issues in the data registry).

@jpmckinney
Copy link
Member Author

jpmckinney commented Apr 10, 2024

Some publications have invalid JSON, but their spiders don't require any deserialization (the JSON stays as bytes). For these, we don't need to filter them out in Kingfisher Collect, as Kingfisher Process can handle invalid JSON. I think the condition to stay as bytes is:

not spider.concatenated_json and (see next comment) not spider.root_path and item.data_type not in ('release', 'record') and not getattr(spider, 'resize_package', False)

So this new middleware can just yield item and continue if already deserialized (isinstance(data, (dict, list))) or the above.

@jpmckinney
Copy link
Member Author

Ah, going back to:

I wondering if we want this middleware to be run after ConcatenatedJSONMiddleware and LineDelimitedMiddleware, in case only one item of a file is invalid

ConcatenatedJSONMiddleware can yield items up to the invalid one, but I don't think ijson can continue past some invalid JSON text (in some scenarios, we can probably find where good data starts again, but I don't think there's a universal solution). So, for this issue, maybe we handle the exception in ConcatenatedJSONMiddleware (or, maybe we open a new issue, and for now require that concatenated JSON must be valid).

@jpmckinney
Copy link
Member Author

Re: my last two comments, the behavior is opt-in in #1066, which is also fine, as it spares some deserialization and reserialization in cases where e.g. we set root_path but we know the JSON is always valid. This is a good approach, as invalid JSON should be rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework-items Relating to how we process items
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants