Skip to content
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.

Put multi-line cell content into a single cell #23

Open
jpmckinney opened this issue Oct 10, 2013 · 6 comments
Open

Put multi-line cell content into a single cell #23

jpmckinney opened this issue Oct 10, 2013 · 6 comments

Comments

@jpmckinney
Copy link

Out of the test data, the files that don't copy-paste from Preview to Excel cleanly are:

  • bo_page24.pdf
  • gre.pdf
  • vertical_rulings_bug.pdf

On the other hand, Tabula chokes on some PDFs that copy-paste just fine! For example, page 2 of this PDF from this website.

I cooked up an AppleScript to do bulk copy-pasting in these cases: https://github.com/opennorth/copy_paste_pdf

I'm not sure why Tabula has trouble with the linked PDF, but maybe it will be another useful test case.

@jeremybmerrill
Copy link
Member

James,

Could you clarify how Tabula "chokes" on that Newfoundland and Labrador PDF? Do you mean that Tabula outputs multi-line cells on different lines, where copy-paste properly includes them on the same line? Or does Tabula completely fail to output anythign?

Thanks!

@jpmckinney
Copy link
Author

Sure, @jeremybmerrill. Using Tabula from git HEAD, with that PDF, Tabula either:

  1. Moves a cell that should be in the same row as other cells into a new row below
  2. Splits the content of one or more cells in a given row onto multiple rows (either above or below the row with the rest of the cells)

When copy-and-pasting that PDF, the first error never occurs, and the second error occurs less frequently. When it does occur, the rows are split into new rows that appear below, and never above, making it easier to write a script to clean the CSV.

If I do not select the table precisely (and just put a square around the whole page), the CSV is much worse than copy-and-pasting. "Autodetect Tables" gives JavaScript errors, so I couldn't test it.

@jpmckinney
Copy link
Author

I've put a gist to compare the two CSVs here: https://gist.github.com/jpmckinney/6921697

The copy-and-paste method creates more empty rows, but those are very easy to clean in post-processing. It adds an extra space at the end of each cell, but that is also very easy to clean. The copy-and-paste method is nearly perfect in this case, whereas Tabula requires careful post-processing.

@jeremybmerrill
Copy link
Member

Thanks for the additional details, @jpmckinney!

By #! do you mean like where Tabula puts "Classifications Appeal Board" a line below "Jean Myrick" on row 1 of page 2? Just want to be clear. :)

Assuming so, that's a bug we're aware of (though I can't find the issue here...). It's definitely a big one.

You're right that the post-processing to combine cells is tough. I've written quite a few bespoke scripts to deal with that output for a production project that uses tabula-extractor. They're a pain in the ass, I know... I'm sure the guy who inherited that project from me would be ecstatic if we solved it. But we're not quite there yet, algorithmically.

I think our approach would be use the line elements on the page to group text elements with different y-locations into a single cell. This may have to wait until #16 is finished, because the more line detection we do via computer vision (our current approach for detecting tables, as opposed to cells), the slower Tabula will be.

Another approach (ignoring lines) might be to use some sort of heuristic to look at the distances between a cell and the closest non-empty one above it. If the distance is relatively greater, it might be a new row; if it's less, it might be a continuation of the previous cell. This might get gross, though -- and only be successful some of the time

Would love to hear your input and we appreciate the test file.

@jpmckinney
Copy link
Author

Yup, that's what I mean by point 1.

Yeah, #16 seems to be the solution. I was originally going to hack together a script to find rectangles, tesseract each rectangle, and recompose a table, when I discovered that copy-pasting magically worked (it helps that the PDF was exported from Excel). #16 sounds much more robust!

@jeremybmerrill
Copy link
Member

Great. Because PDF is such a shit format, different PDF generators generate radically different structures that represent similar-looking PDFs. Getting perfect coverage is definitely a goal, though there's obviously work still to be done.

You might be intrigued by the (dochive)[https://github.com/raleighpublicrecord/dochive] project. I don't know if it's still active, but I know that their aim was to do just what you were thinking for scanned PDFs -- create a template system or use CV to find rectangles, tesseract their contents, and export that as a CSV.

I just renamed the issue, I hope you don't mind.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants