extraction without defining vertical lines #907
Replies: 5 comments 31 replies
-
Hi @88arvin Appreciate your interest in the library. The PDF you have provided appears to be a scanned PDF and I am unable to do proper analysis on it. Assuming this is because you tried redacting sensitive information from it, and have access to the text PDF. Have you considered using the |
Beta Was this translation helpful? Give feedback.
-
I have tried |
Beta Was this translation helpful? Give feedback.
-
If you can use the column names, you could draw lines at the start of each column, apart from I've used You can use columns = 'Vr.Date', 'Vr.No', 'Vr.Type', 'Particulars', 'Dr.Amt', 'Cr.Amt', 'Balance'
# x0 = start, x1 = end
borders = dict.fromkeys(columns, 'x0')
borders['Vr.Type'] = 'x1'
rows = []
for page in pdf.pages:
vlines = [
page.search(column, regex=False)[0][position] for column, position in borders.items()
] + [ page.bbox[-2] - 70 ]
table = page.extract_table(
dict(explicit_vertical_lines=vlines, horizontal_strategy='text')
)
# skip blank line and column names
rows.extend(table[3:])
# drop any rows with empty `Vr.Date`
df = pd.DataFrame(rows, columns=columns).mask(lambda df: df['Vr.Date'] == '').dropna(subset='Vr.Date') Result:
|
Beta Was this translation helpful? Give feedback.
-
Yes. The code works fine on
|
Beta Was this translation helpful? Give feedback.
-
Yes, this PDF is really complicated. Now getting new error. |
Beta Was this translation helpful? Give feedback.
-
I have attached the sample PDF. I want to convert it into a Pandas dataframe. I know it can be done by explicitly defining the vertical lines. However, I want to know if there is any other way of doing it because same columns are placed differently on each page.
sample.pdf
Beta Was this translation helpful? Give feedback.
All reactions