extraction without defining vertical lines #907

88arvin · 2023-06-16T12:15:46Z

88arvin
Jun 16, 2023

I have attached the sample PDF. I want to convert it into a Pandas dataframe. I know it can be done by explicitly defining the vertical lines. However, I want to know if there is any other way of doing it because same columns are placed differently on each page.
sample.pdf

samkit-jain · 2023-06-16T13:14:23Z

samkit-jain
Jun 16, 2023
Collaborator

Hi @88arvin Appreciate your interest in the library. The PDF you have provided appears to be a scanned PDF and I am unable to do proper analysis on it. Assuming this is because you tried redacting sensitive information from it, and have access to the text PDF. Have you considered using the text strategy for vertical lines? If that is not giving the desired output, maybe instead of manually defining the vertical lines, you can write a function to auto identify those. You can find all the coordinates that don't have any text overlapping and use those for your vertical lines. It may give you some false positives that you can filter out.

0 replies

88arvin · 2023-06-16T16:22:44Z

88arvin
Jun 16, 2023
Author

I have tried
extract_tables(table_settings={"keep_blank_chars": True,"vertical_strategy": "text","horizontal_strategy": "text"})
but the problem is that new columns are getting created because the positions of the columns vary from page to page. I want all balances under the balance column, all particular values under the particular column, and so on.

sample(1).pdf

0 replies

cmdlineluser · 2023-06-18T16:51:22Z

cmdlineluser
Jun 18, 2023

If you can use the column names, you could draw lines at the start of each column, apart from Vr.Type which you can draw at the end.

I've used page.bbox[-2] - 70 to draw the last line here, but you could use another marker, e.g. the right-most Cr. or Dr. perhaps.

You can use .search() to find words and we extract their x0/x1 position as our lines.

columns = 'Vr.Date', 'Vr.No', 'Vr.Type', 'Particulars', 'Dr.Amt', 'Cr.Amt', 'Balance'

# x0 = start, x1 = end
borders = dict.fromkeys(columns, 'x0')
borders['Vr.Type'] = 'x1'

rows = []

for page in pdf.pages:
   vlines = [ 
      page.search(column, regex=False)[0][position] for column, position in borders.items() 
   ]  + [ page.bbox[-2] - 70 ]
   
   table = page.extract_table(
      dict(explicit_vertical_lines=vlines, horizontal_strategy='text')
   )
   
   # skip blank line and column names
   rows.extend(table[3:])

# drop any rows with empty `Vr.Date`
df = pd.DataFrame(rows, columns=columns).mask(lambda df: df['Vr.Date'] == '').dropna(subset='Vr.Date')

Result:

      Vr.Date  Vr.No Vr.Type                                        Particulars       Dr.Amt    Cr.Amt                    Balance
1   01-Apr-15                                                   Opening Balance               24274200                   24274200
3   17-Aug-15    4IV       R                           BEING EXCHANGE RATE DIFF            0    979400               25253600 Cr.
5   17-Aug-15  16 BP       R                                           BEINGTRF      6530000         Q               18723600 Cr.
7   07-Sep-15  58 BP       C             280000,$,@66.87, B.Ref-, Inv- BEINGTRF     18723600         0                      0 Dr.
9   07-Sep-15  58 BP       C            0,$,@0, B.Ref-, Inv- BEING INTERST PAID            0         0                      0 Dr.
11  20-Oct-15   3 JV       C  610,$,@65.22, B.Ref-, Inv- Being the amountvid...            0     39784                  39784 Cr.
13  20-Oct-15  20 BP       C  610,$,@65.22, B.Ref-, Inv- BEING IMPORTPYMNET ...        39784         0                      O Dr.
15  01-Apr-16                                                   Opening Balance               27614207                   27614207
17  13-May-16  38 BP       R  Chq. No. 085900 BILL NO.133, 132, 139, ,138, 1...   2640 6576          0   1207 631 Cr.            
19  25-Jun-16  50 BP       R  Chq. No. 086352 BEING COMMISSION RETURNED TO P...       132696         0                1074935 Cr.
21  19-Aug-16   1 BP       R  Chq. No. 086866 Being Payment against bill no....      1074935         0                      0 Dr.

15 replies

88arvin Jun 20, 2023
Author

sample 3.pdf

cmdlineluser Jun 20, 2023

I guess you mean page 2.

For pages like that, it may be better to use .extract_text() e.g.

import pdfplumber
import re
import pandas as pd

page = pdf.pages[1]

text = page.extract_text(layout=True, keep_blank_chars=True, x_density=1)

df = pd.DataFrame(
   [ 
      re.split(r'\s{3,}', line.strip()) 
      for line in text.splitlines() if line.strip() 
   ]
)

            0      1     2                                                  3            4            5               6
0        2709   None  None                                               None         None         None            None
1   21-Dec-14   53SE     I  $ SLML/GJ/53/2014-15 B.L No - 594/22-DEC-14  R...   47778930.8            0   1684108498Dr.
2   28-Dec-14   54SE     I  $ SLML/GJ/54/2014-15 B.L No - 605/29-DEC-14  R...   54498405.2            0   1738606903Dr.
3   06-Jan-15   55SE     I  $ SLML/GJ/55/2014-15 B.L No - 07/07-JAN-15  Re...  41537048.75            0   1780143952Dr.
4   12-Jan-15   56SE     I  $ SLML/GJ/56/2014-15 B.L No - 15/13-JAN-15  Re...  65721149.75            0   1845865102Dr.
5   19-Jan-15   57SE     I  $ SLML/GJ/57/2014-15 B.L No - 28/20-JAN-15  Re...   49151688.6            0   1895016790Dr.
6   27-Jan-15   58SE     I  $ SLML/GJ/58/2014-15 B.L No - 50/28-JAN-15  Re...   54240753.3            0   1949257544Dr.
7   02-Feb-15    7BR     C          12074480,$,@61.79, B.Ref-, Inv- BEING TRF            0    746082119   1203175425Dr.
8   03-Feb-15   59SE     I  $ SLML/GJ/59/2014-15 B.L No - 59/04-FEB-15  Re...   40834872.4            0   1244010297Dr.
9   04-Feb-15    8BR     C            5881430,$,@61.6, B.Ref-, Inv- BEING TRF            0    362296088  881714209.1Dr.
10  05-Feb-15    3SE     I               $ KNP003 B.L No - 1/05-FEB-15  Rem:-  794179317.2            0   1675893526Dr.
11  10-Feb-15    4SE     I               $ KNP004 B.L No - 1/10-FEB-15  Rem:-  386956109.5            0   2062849636Dr.
12  15-Feb-15   60SE     I  $ SLML/GJ/60/2014-15 B.L No - 82/16-FEB-15  Re...  65600148.65            0   2128449784Dr.
13  22-Feb-15   61SE     I  $ SLML/GJ/61/2014-15 B.L No - 98/23-FEB-15  Re...     52560591            0   2181010375Dr.
14  02-Mar-15   62SE     I  $ SLML/GJ/62/2014-15 B.L No - 115/03-MAR-15  R...   64308338.4            0   2245318714Dr.
15  10-Mar-15   63SE     I  $ SLML/GJ/63/2014-15 B.L No - 135/11-MAR-15  R...   49494631.2            0   2294813345Dr.
16  16-Mar-15   64SE     I  $ SLML/GJ/64/2014-15 B.L No - 149/17-MAR-15  R...     72011214            0   2366824559Dr.
17  22-Mar-15   65SE     I  $ SLML/GJ/65/2014-15 B.L No - 161/23-MAR-15  R...     44686376            0   2411510935Dr.
18  29-Mar-15   66SE     I  $ SLML/GJ/66/14-15 B.L No - 176/30-MAR-15  Rem...     78216224            0   2489727159Dr.
19  31-Mar-15  423JV     R  EXCH FLUC ON DIAMOND INTERNATIONAL TRADING TOW...   79478282.6            0   2569205442Dr.
20  31-Mar-15  423JV     R  EXCH FLUC ON DIAMOND INTERNATIONAL TRADING TOW...   4756051.04            0   2573961493Dr.
21  31-Mar-15  425JV     R  DIAMOND INTERNATION TRADING IMPORT & EXPORT AC...            0    990000000   1583961493Dr.
22  31-Mar-15  425JV     R  DIAMOND INTERNATION TRADING IMPORT & EXPORT AC...            0  299544170.7   1284417322Dr.
23  31-Mar-15  448JV     R  louis dreyfus commodities asia pte ltd balance...            0  969760618.6  314656703.4Dr.

You could clean up the Dr. or Cr. from the last column.

88arvin Jun 20, 2023
Author

You mean I have to split these kinds of pages into separate pdf?
Can't we do something like this in the try block, your previous code, and then in the exception block, this code, and then merge both?

88arvin Jun 20, 2023
Author

columns = 'Vr.Date', 'Vr.No', 'Vr.Type', 'Particulars', 'Dr.Amt', 'Cr.Amt', 'Balance'

borders = dict.fromkeys(columns, 'x0')
borders['Vr.Type'] = 'x1'

data = []
with pdfplumber.open(pdf) as pdf:
    for page in pdf.pages:
        try:
            name = page.extract_text_lines(return_chars=False)[1]['text']
            first_col = page.search(columns[0], regex=False)[0]
            bbox = first_col['x0'], first_col['top'], *page.bbox[-2:]
            crop = page.crop(bbox)
            vlines = [crop.search(column, regex=False)[0][position] for column, position in borders.items()] + [ max(page.search('Dr.',regex=False), key=lambda word: word['x1'])['x1'] ]
            table = crop.extract_table(dict(explicit_vertical_lines=vlines, horizontal_strategy='text'))
        except:
            text = page.extract_text(layout=True, keep_blank_chars=True, x_density=1)
            table = [re.split(r'\s{3,}', line.strip()) for line in text.splitlines() if line.strip()]
        data.extend(table)
    
    df = pd.DataFrame(data)

I am getting the result with this but not the desired result. Because it is creating 9 columns so when I passing columns, getting ValueError: 7 columns passed, passed data had 9 columns

cmdlineluser Jun 20, 2023

It works on sample.3.pdf without error.

You can try locate what page is creating 9 columns and see what's different about it.

You could check len(pd.DataFrame(table).columns) inside the for loop for example.

88arvin · 2023-06-20T15:01:18Z

88arvin
Jun 20, 2023
Author

Yes. The code works fine on sample 3. You will see the difference in this.
And how to add the name column in the dataframe?
sample4.pdf

I'm really grateful to you! Your help is greatly appreciated. You're truly amazing!

2 replies

cmdlineluser Jun 20, 2023

There are so many variations in this PDF which make it really awkward.

Perhaps it makes sense to try to use some of the content to help finding the columns.

Here we try to isolate the date/type columns and the balance columns and draw lines either side of them.

So for example:

min(page.search(rf'{DR}{CR}{BALANCE}'), key=itemgetter('x0'))

min() on the x0 value means we want the "left most" match on the page. (max() on x1 = "right most")

import pdfplumber
from operator import itemgetter

pdf = pdfplumber.open('sample4.pdf')

page = pdf.pages[1]

DATE    = r'(?im)^\d{2}-(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)-\d{2}'
DR      = r'\s+\d+(?:[.]\d+)?'
CR      = r'\s+\d+(?:[.]\d+)?'
BALANCE = r'\s+\d+(?:[.]\d+)?\s*[CD]r[.]\n'
VR_NO   = r'\d+\s*[A-Z]+'
VR_TYPE = r'[A-Z]'

positions = [
   min(page.search(rf'{DR}{CR}{BALANCE}'),                key=itemgetter('x0')),
   min(page.search(rf'{DR}(?={CR}{BALANCE})'),            key=itemgetter('x0')),
   min(page.search(rf'{BALANCE}'),                        key=itemgetter('x0')),
   max(page.search(rf'{DATE}\s+{VR_NO}\s+{VR_TYPE}'),     key=itemgetter('x1')),
   max(page.search(rf'{DATE}(?=\s+{VR_NO}\s+{VR_TYPE})'), key=itemgetter('x1')),
   max(page.search(rf'{DATE}\s+{VR_NO}(?=\s+{VR_TYPE})'), key=itemgetter('x1')),
]

vlines = set().union(*[[pos['x0'], pos['x1']] for pos in positions])

im = page.to_image(300)

im.draw_vlines(vlines, stroke_width=3)

im.save('lines.png')

We only need to find the date/type/balance columns as those lines will automatically leave the Particulars column:

So if there are no headers you can fallback to using these vlines e.g.

table = page.extract_table(dict(
   explicit_vertical_lines = vlines, 
   horizontal_strategy = 'text'
))

You could also try use the dates for the horizontal lines instead of text if you wanted to:

# don't forget to add the bottom of the last line otherwise we miss a row
dates  = page.search(DATE)
hlines = [ date['top'] for date in dates ] + [ dates[-1]['bottom'] ]

It seems to work for the samples you've provided from a quick test.

cmdlineluser Jun 21, 2023

It probably makes sense to put each approach in its own function.

So you could do something like this:

import pdfplumber
import pandas as pd
from   operator import itemgetter

def extract_table_by_regex(page):
    DATE = r'(?im)^\d{2}-(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)-\d{2}'
    DR = r'\s+\d+(?:[.]\d+)?'
    CR = r'\s+\d+(?:[.]\d+)?'
    BALANCE = r'\s+\d+(?:[.]\d+)?\s*[CD]r[.]\n'
    VR_NO = r'\d+\s*[A-Z]+'
    VR_TYPE = r'[A-Z]'

    positions = [
       min(page.search(rf'{DR}{CR}{BALANCE}'),                key=itemgetter('x0')),
       min(page.search(rf'{DR}(?={CR}{BALANCE})'),            key=itemgetter('x0')),
       min(page.search(rf'{BALANCE}'),                        key=itemgetter('x0')),
       max(page.search(rf'{DATE}\s+{VR_NO}\s+{VR_TYPE}'),     key=itemgetter('x1')),
       max(page.search(rf'{DATE}(?=\s+{VR_NO}\s+{VR_TYPE})'), key=itemgetter('x1')),
       max(page.search(rf'{DATE}\s+{VR_NO}(?=\s+{VR_TYPE})'), key=itemgetter('x1')),
    ]

    vlines = set().union(*[[pos['x0'], pos['x1']] for pos in positions])
    dates  = page.search(DATE) # also add last bottom line
    hlines = [ date['top'] for date in dates ] + [ dates[-1]['bottom'] ] 

    table = page.extract_table(dict(
       explicit_vertical_lines   = vlines,
       explicit_horizontal_lines = hlines,
    ))

    return table

def extract_table_by_column_names(page):
    positions = dict.fromkeys(columns, 'x0')
    positions['Vr.Type'] = 'x1'

    first_col = page.search(columns[0], regex=False)[0]
    
    name = page.extract_text_lines(return_chars=False)[1]['text']

    bbox = first_col['x0'], first_col['top'], *page.bbox[-2:]
    crop = page.crop(bbox)

    # right-most Dr. on page (used to try match end of last column)
    right = max(page.search('Dr.',regex=False), key=itemgetter('x1'))['x1']

    vlines = [
       crop.search(column, regex=False)[0][position] 
       for column, position in positions.items()
    ] + [ right ]

    table = crop.extract_table(dict(
        explicit_vertical_lines = vlines, 
        horizontal_strategy     = 'text',
    ))

    return name, table[1:] # skip header


columns = ['Vr.Date', 'Vr.No', 'Vr.Type', 'Particulars', 'Dr.Amt', 'Cr.Amt', 'Balance']

pdf = 'Downloads/sample4.pdf'
name = None
data = []

with pdfplumber.open(pdf) as pdf:
    for page in pdf.pages:
        try:
            name, table = extract_table_by_column_names(page)
        except:
            table = extract_table_by_regex(page)
        data.extend([name] + row for row in table)

df = pd.DataFrame(data, columns=['Name'] + columns)

Some of the balance columns don't have a space before Cr. but that's easy to fix up.

    Name    Vr.Date  Vr.No Vr.Type                                        Particulars   Dr.Amt  Cr.Amt      Balance
0   None  08-Dec-14   31BP       R  Chq. No. 078265 AGST BILL NO-4718,4736,4717,47...  1760706       0  32924745Cr.
1   None  08-Dec-14   35BP       R    Chq. No. 078269 AGST BILL NO-4760,DT-27.11.2014   624234       0  32300511Cr.
2   None  09-Dec-14   18PV       D   P.B.No:-4897,BDT:-05-DEC-14,Truck No:-HR63A/1922        0  356310  32656821Cr.
3   None  09-Dec-14   18PV       D    P.B.No:-4897,BDT:-05-DEC-14, Weight Short D.N ,     1110       0  32655711Cr.
4   None  09-Dec-14   23BP       R  Chq. No. 078305 AGST BILL NO-4765,4788,4764,DT...  1248022       0  31407689Cr.
..   ...        ...    ...     ...                                                ...      ...     ...          ...
90  None  28-Nov-14  45 BP       R  Chq. No. 078049 agst bill no :- 2302,2298 dt-1...   993127       0  2199802 Cr.
91  None  05-Dec-14  12 BP       R  Chq. No. 078196 agst bill no-2319,2329,2328,,d...  1105368       0  1094434 Cr.
92  None  05-Dec-14  16 BP       R  Chq. No. 078200 agst bill no-2334,2335,,dt-23....  1028148       0    66286 Cr.
93  None  22-Dec-14  11 BP       R  Chq. No. 078464 BILL NO- 2319,2329,2328, CD RE...    34344       0    31942 Cr.
94  None  27-Dec-14  22 BP       R  Chq. No. 078585 bill no- 2334,2335, date 23.11...    31942       0        0 Dr.

[95 rows x 8 columns]

88arvin · 2023-06-21T07:37:57Z

88arvin
Jun 21, 2023
Author

Yes, this PDF is really complicated. Now getting new error.

37.pdf

14 replies

88arvin Jun 24, 2023
Author

Your point is valid. There will also be a problem if there are values in the Balance columns that don't have Dr. or Cr. at the end.
ValueError: min() arg is an empty sequence

cmdlineluser Jun 24, 2023

Indeed. Infact, none of the current regexes match anything on page 1 of min.pdf.

This is why using the column names was the initial approach as it's much less prone to error.

This is the latest version of the code I've tested and it seems to work on all the previous PDF samples you've provided:

import pdfplumber
import pandas as pd
from   operator import itemgetter

DATE = r'(?im)^\d{2}-(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)-\d{2}'
DR = r'\s+\d+(?:[.]\d+)?'
CR = r'\s+\d+(?:[.]\d+)?'
BALANCE = r'\s+\d+(?:[.]\d+)?\s*[CD]r[.]\s*\n?'
VR_NO = r'\d+\s*[A-Z]+'
VR_TYPE = r'[A-Z]'

CR_OR_DR = '[CD]r.(?!Amt)\n?'
NUMBER = r'\d'

def extract_table_by_regex(page):
    positions = [
       min(page.search(rf'{DR}{CR}{BALANCE}'),                key=itemgetter('x0')),
       min(page.search(rf'{DR}(?={CR}{BALANCE})'),            key=itemgetter('x0')),
       min(page.search(rf'{BALANCE}'),                        key=itemgetter('x0')),
       min(page.search(rf'{DATE}\s+{VR_NO}\s+{VR_TYPE}'),     key=itemgetter('x0')),
       min(page.search(rf'{DATE}(?=\s+{VR_NO}\s+{VR_TYPE})'), key=itemgetter('x0')),
       min(page.search(rf'{DATE}\s+{VR_NO}(?=\s+{VR_TYPE})'), key=itemgetter('x0')),
    ]

    vlines = set().union(*[[pos['x0'], pos['x1']] for pos in positions])
    dates  = page.search(DATE) # also add last bottom line
    hlines = [ date['top'] for date in dates ] + [ dates[-1]['bottom'] ]

    table = page.extract_table(dict(
       explicit_vertical_lines = vlines,
       explicit_horizontal_lines = hlines,
       intersection_tolerance = 1
    ))

    return table

def extract_table_by_column_names(page):
    positions = dict.fromkeys(columns, 'x0')
    positions['Vr.Type'] = 'x1'

    first_col = page.search(columns[0], regex=False)[0]
    
    name = page.extract_text_lines(return_chars=False)[1]['text']

    bbox = first_col['x0'], first_col['top'], *page.bbox[-2:]
    crop = page.crop(bbox)

    # Look for Cr. Dr. in balance column, else look for right-most number
    right = page.search(CR_OR_DR) or page.search(NUMBER)
    right = max(right, key=itemgetter('x1'))['x1']

    vlines = [
       crop.search(column, regex=False)[0][position] 
       for column, position in positions.items()
    ] + [ right ]

    dates  = page.search(DATE) # also add last bottom line
    hlines = [ date['top'] for date in dates ] + [ dates[-1]['bottom'] ]

    table = crop.extract_table(dict(
       explicit_vertical_lines = vlines, 
       explicit_horizontal_lines = hlines,
       intersection_tolerance = 1
    ))

    return name, table


columns = ['Vr.Date', 'Vr.No', 'Vr.Type', 'Particulars', 'Dr.Amt', 'Cr.Amt', 'Balance']

pdf = 'Downloads/min.pdf'
name = None
data = []

with pdfplumber.open(pdf) as pdf:
    for page in pdf.pages:
        try:
           name, table = extract_table_by_column_names(page)
        except:
           table = extract_table_by_regex(page)
        data.extend([name] + row for row in table)

df = pd.DataFrame(data, columns=['Name'] + columns)

88arvin Jun 26, 2023
Author

works perfectly, but stopped at

1-2.pdf

cmdlineluser Jun 26, 2023

It probably makes sense to crop in the regex function too, it looks like some blank lines above/below are getting picked up and adding in None values.

import pdfplumber
import pandas as pd
from   operator import itemgetter

DATE = r'(?im)^\d{2}-(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)-\d{2}'
DR = r'\s+\d+(?:[.]\d+)?'
CR = r'\s+\d+(?:[.]\d+)?'
BALANCE = r'\s+\d+(?:[.]\d+)?\s*[CD]r[.]\s*\n?'
VR_NO = r'\d+\s*[A-Z]+'
VR_TYPE = r'[A-Z]'

CR_OR_DR = r'[CD]r.(?!Amt)\n?'
NUMBER = r'\d'

def extract_table_by_regex(page):
    positions = [
        min(page.search(rf'{DR}{CR}{BALANCE}'),                key=itemgetter('x0')),
        min(page.search(rf'{DR}(?={CR}{BALANCE})'),            key=itemgetter('x0')),
        min(page.search(rf'{BALANCE}'),                        key=itemgetter('x0')),
        min(page.search(rf'{DATE}\s+{VR_NO}\s+{VR_TYPE}'),     key=itemgetter('x0')),
        min(page.search(rf'{DATE}(?=\s+{VR_NO}\s+{VR_TYPE})'), key=itemgetter('x0')),
        min(page.search(rf'{DATE}\s+{VR_NO}(?=\s+{VR_TYPE})'), key=itemgetter('x0')),
    ]

    dates = page.search(DATE) 

    vlines = set().union(*[[pos['x0'], pos['x1']] for pos in positions])
    hlines = [ date['top'] for date in dates ] + [ dates[-1]['bottom'] ]

    bbox = min(vlines), dates[0]['top'], max(vlines), dates[-1]['bottom']
    crop = page.crop(bbox)

    table = crop.extract_table(dict(
       explicit_vertical_lines   = vlines,
       explicit_horizontal_lines = hlines,
       intersection_tolerance = 1
    ))

    return table

def extract_table_by_column_names(page):
    positions = dict.fromkeys(columns, 'x0')
    positions['Vr.Type'] = 'x1'

    dates = page.search(DATE) 

    first_col = page.search(columns[0], regex=False)[0]

    name = page.extract_text_lines(return_chars=False)[1]['text']

    # Look for Cr. Dr. in balance column, else look for right-most number
    right = page.search(CR_OR_DR) or page.search(NUMBER)
    right = max(right, key=itemgetter('x1'))['x1']

    bbox = first_col['x0'], first_col['top'], right, dates[-1]['bottom']
    crop = page.crop(bbox)

    vlines = [
       crop.search(column, regex=False)[0][position] 
       for column, position in positions.items()
    ] + [ right ]

    hlines = [ date['top'] for date in dates ] + [ dates[-1]['bottom'] ]

    table = crop.extract_table(dict(
       explicit_vertical_lines = vlines, 
       explicit_horizontal_lines = hlines,
       intersection_tolerance = 1
    ))

    return name, table


columns = [
   'Vr.Date', 'Vr.No', 'Vr.Type', 'Particulars', 'Dr.Amt', 'Cr.Amt', 'Balance'
]

pdf = 'Downloads/1-2.pdf'
name = None
data = []

with pdfplumber.open(pdf) as pdf:
    for page in pdf.pages:
        try:
           name, table = extract_table_by_column_names(page)
        except:
           table = extract_table_by_regex(page)
        data.extend([name] + row for row in table)

df = pd.DataFrame(data, columns=['Name'] + columns)

88arvin Jul 4, 2023
Author

I appreciate your support and assistance.
It is working pretty well. Although it is mostly functioning properly, there are certain areas that require manual fixes due to a PDF bug.
Thank you for your efforts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extraction without defining vertical lines #907

{{title}}

Replies: 5 comments 31 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

extraction without defining vertical lines #907

Replies: 5 comments · 31 replies

samkit-jain Jun 16, 2023 Collaborator

88arvin Jun 16, 2023 Author

88arvin Jun 20, 2023 Author

88arvin Jun 20, 2023 Author

88arvin Jun 20, 2023 Author

88arvin Jun 20, 2023 Author

88arvin Jun 21, 2023 Author

88arvin Jun 24, 2023 Author

88arvin Jun 26, 2023 Author

88arvin Jul 4, 2023 Author

Replies: 5 comments 31 replies

samkit-jain
Jun 16, 2023
Collaborator

88arvin
Jun 16, 2023
Author

88arvin Jun 20, 2023
Author

88arvin Jun 20, 2023
Author

88arvin Jun 20, 2023
Author

88arvin
Jun 20, 2023
Author

88arvin
Jun 21, 2023
Author

88arvin Jun 24, 2023
Author

88arvin Jun 26, 2023
Author

88arvin Jul 4, 2023
Author