Skip to content
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.

Only horizontal lines, columns too close to each other #81

Open
abelsonlive opened this issue May 19, 2014 · 4 comments
Open

Only horizontal lines, columns too close to each other #81

abelsonlive opened this issue May 19, 2014 · 4 comments

Comments

@abelsonlive
Copy link

While using tabula-extractor to parse this PDF (pages 1 - 151), I ran into some interesting issues:

  1. While there are no visible 'ruling lines', the rows are colored differently – something that I suspect shows up in many cases.
  2. The extraction performed much, much better when specifying the pixel positions of each column.
  3. I eventually merged the multi-line cells using a python script. I imagine there might be a way to add a flag in tabula-extractor for a general "merge-down" or "merge-up" post-processing step. It could probably follow the logic of this function:
def merge(data, id_col = 'id', direction='down'):
  for i, row in enumerate(data):
    # every row should have an `id_col`, 
    # if it doesn't then it means we're 
    # at a multiline cell
    if r[id_col] == '':  
      # find non-empty cells to merge down
      merge_keys = [k for k,v in row.items() if v!='']
      for k in merge_keys:
        # determine merge index based off of `direction` arg
        if direction == 'down':
           merge_idx = i + 1
        elif direction == 'up':
          merge_idx = i - 1
        # merge multi-line cells
        data[merge_idx][k] = '%s %s' % (data[i][k], data[merge_idx][k])
        # delete row which we merged
        del data[i]
  return data

I also had to run this multiple times to catch those instances in which there were 3 or more lines in an individual cell.

@jazzido
Copy link
Contributor

jazzido commented May 19, 2014

Awesome, @abelsonlive! Thanks for the report.

While there are no visible 'ruling lines', the rows are colored differently – something that I suspect shows up in many cases.

Our current spreadsheet algorithm takes ruling lines into account only if they form a full grid. We should also consider the case where there are only horizontal rulers.

The extraction performed much, much better when specifying the pixel positions of each column.

There isn't much space between columns on that PDF (eg. row 9 on the first page), so the extractor can't reliably detect the widths of the columns. That's exactly why there's an option for specifying columns' positions.

@jazzido jazzido changed the title Edge Case Only horizontal lines, columns too close to each other May 19, 2014
@jazzido
Copy link
Contributor

jazzido commented Jun 9, 2014

I'm playing with a new technique for segmenting tables. This case is successfully handled with no parameter tweaking at all:

m27-1

@lukehsiao
Copy link
Contributor

@jazzido: Can you give a brief description of what this new technique is? Or, what this picture is showing?

@jazzido
Copy link
Contributor

jazzido commented Sep 30, 2015

@lukehsiao, it's an implementation of a classic technique in document analysis and segmentation. The basic idea is to calculate the vertical and horizontal projections (sum of heights and widths) of the glyphs and then analyze the resulting profiles (green and red curves) to segment the area of interest. In our implementation, we place a row (column) separator wherever there is a change of slope in the horizontal (vertical) profile.

There is some code in tabula-java that implements this, but it's not integrated with the extraction algorithms that we currently use.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants