-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large-scale analysis of mmda failures #206
Comments
Fails if PDF contains an empty page (e.g.,
|
Fails if PDF has non pdfplumber rows (e.g.,
Note that this case happens a lot with PDFs that are scans, contain no text, etc. stuff that it is definitely not an academic paper. |
Fails on corrupted files (e.g.,
We should fail safe on this, maybe return empty doc? |
There's a doc with 11,000+ pages, rasterization hungs up for a very long time: @kyleclo mentioned to look at ways we can just parse metadata, and skip if very large pdf. |
Another pdfplumber row error on document |
Word predictor failed on @kyleclo mentioned maybe we don't run word predictor.
It is a scan of an old doc, maybe something to do with weird characters? |
Another error in word predictor (
PDF looks ok? |
Final stats on the sample I ran on
All PDFs that failed
|
Using this issue to document failures when running of ~700 PDFs
The text was updated successfully, but these errors were encountered: