-
Notifications
You must be signed in to change notification settings - Fork 57
Put multi-line cell content into a single cell #23
Comments
James, Could you clarify how Tabula "chokes" on that Newfoundland and Labrador PDF? Do you mean that Tabula outputs multi-line cells on different lines, where copy-paste properly includes them on the same line? Or does Tabula completely fail to output anythign? Thanks! |
Sure, @jeremybmerrill. Using Tabula from git HEAD, with that PDF, Tabula either:
When copy-and-pasting that PDF, the first error never occurs, and the second error occurs less frequently. When it does occur, the rows are split into new rows that appear below, and never above, making it easier to write a script to clean the CSV. If I do not select the table precisely (and just put a square around the whole page), the CSV is much worse than copy-and-pasting. "Autodetect Tables" gives JavaScript errors, so I couldn't test it. |
I've put a gist to compare the two CSVs here: https://gist.github.com/jpmckinney/6921697 The copy-and-paste method creates more empty rows, but those are very easy to clean in post-processing. It adds an extra space at the end of each cell, but that is also very easy to clean. The copy-and-paste method is nearly perfect in this case, whereas Tabula requires careful post-processing. |
Thanks for the additional details, @jpmckinney! By #! do you mean like where Tabula puts "Classifications Appeal Board" a line below "Jean Myrick" on row 1 of page 2? Just want to be clear. :) Assuming so, that's a bug we're aware of (though I can't find the issue here...). It's definitely a big one. You're right that the post-processing to combine cells is tough. I've written quite a few bespoke scripts to deal with that output for a production project that uses tabula-extractor. They're a pain in the ass, I know... I'm sure the guy who inherited that project from me would be ecstatic if we solved it. But we're not quite there yet, algorithmically. I think our approach would be use the line elements on the page to group text elements with different y-locations into a single cell. This may have to wait until #16 is finished, because the more line detection we do via computer vision (our current approach for detecting tables, as opposed to cells), the slower Tabula will be. Another approach (ignoring lines) might be to use some sort of heuristic to look at the distances between a cell and the closest non-empty one above it. If the distance is relatively greater, it might be a new row; if it's less, it might be a continuation of the previous cell. This might get gross, though -- and only be successful some of the time Would love to hear your input and we appreciate the test file. |
Yup, that's what I mean by point 1. Yeah, #16 seems to be the solution. I was originally going to hack together a script to find rectangles, tesseract each rectangle, and recompose a table, when I discovered that copy-pasting magically worked (it helps that the PDF was exported from Excel). #16 sounds much more robust! |
Great. Because PDF is such a shit format, different PDF generators generate radically different structures that represent similar-looking PDFs. Getting perfect coverage is definitely a goal, though there's obviously work still to be done. You might be intrigued by the (dochive)[https://github.com/raleighpublicrecord/dochive] project. I don't know if it's still active, but I know that their aim was to do just what you were thinking for scanned PDFs -- create a template system or use CV to find rectangles, tesseract their contents, and export that as a CSV. I just renamed the issue, I hope you don't mind. |
Out of the test data, the files that don't copy-paste from Preview to Excel cleanly are:
On the other hand, Tabula chokes on some PDFs that copy-paste just fine! For example, page 2 of this PDF from this website.
I cooked up an AppleScript to do bulk copy-pasting in these cases: https://github.com/opennorth/copy_paste_pdf
I'm not sure why Tabula has trouble with the linked PDF, but maybe it will be another useful test case.
The text was updated successfully, but these errors were encountered: