-
Notifications
You must be signed in to change notification settings - Fork 57
Only horizontal lines, columns too close to each other #81
Comments
Awesome, @abelsonlive! Thanks for the report.
Our current
There isn't much space between columns on that PDF (eg. row 9 on the first page), so the extractor can't reliably detect the widths of the columns. That's exactly why there's an option for specifying columns' positions. |
@jazzido: Can you give a brief description of what this new technique is? Or, what this picture is showing? |
@lukehsiao, it's an implementation of a classic technique in document analysis and segmentation. The basic idea is to calculate the vertical and horizontal projections (sum of heights and widths) of the glyphs and then analyze the resulting profiles (green and red curves) to segment the area of interest. In our implementation, we place a row (column) separator wherever there is a change of slope in the horizontal (vertical) profile. There is some code in |
While using
tabula-extractor
to parse this PDF (pages 1 - 151), I ran into some interesting issues:python
script. I imagine there might be a way to add a flag intabula-extractor
for a general "merge-down" or "merge-up" post-processing step. It could probably follow the logic of this function:I also had to run this multiple times to catch those instances in which there were 3 or more lines in an individual cell.
The text was updated successfully, but these errors were encountered: