Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Extracted text from table is reversed when text is styled with an underline #557

Closed
lavens opened this issue Jun 12, 2024 · 5 comments

Comments

@lavens
Copy link

lavens commented Jun 12, 2024

Description

When I extract text from a pdf that contains a table, where the table content is formatted with underline, each newline of text within a cell is reversed. Once the underline formatting is removed, the text is extracted in order as expected.

Expected Behaviour

I am able to extract text from a table with the order of text preserved regardless of the formatting applied.

Actual Behaviour

Steps to reproduce the behaviour:

  1. Construct a PdfReader from the attached pdf
  2. Get the first page from the reader and construct an extractor with it
  3. Output the result of ExtractText()

Attachments

Without Underline Formatting.pdf
With Underline Formatting.pdf

        pdfReader, _, err := model.NewPdfReaderFromFile(filePath, nil)
        if err != nil {
		return nil, fmt.Errorf("failed to create pdf reader: %w", err)
	}
        ex, err := extractor.New(page)
	if err != nil {
		return "", fmt.Errorf("failed to create extractor: %w", err)
	}

	pageText, err := ex.ExtractText()
	if err != nil {
		return "", fmt.Errorf("failed to extract text: %w", err)
	}
	fmt.Printf("Extracted text: %s\n", pageText)

Output with underline formatting:

EXHIBIT A

SCHEDULE OF PURCHASERS

Initial Closing Date: March 30, 2023

Purchaser Investment Amount Preferred Shares
No. of Series Seed
[email protected]
Investor Fund 1, L.P. $2,000,000 571,428
[email protected]
Investor Fund 2, L.P. $1,000,000 285,714
[email protected]
Investor Fund 2-B, L.P. $200,000 57,142
[email protected]
Private Investor 1 $75,000 21,428
[email protected]
Private Investor 2 $100,000 28,571
[email protected]
Private Investor 3 $95,000 27,142
[email protected]
Private Investor 4 $95,000 27,142
[email protected]
Private Investor 5 $200,000 57,142
[email protected]
Private Investor 6 $50,000 14,285
[email protected]
Private Investor 7 $75,000 21,428
[email protected]
Private Investor 8 $110,000 31,428

TOTALS: $4,000,000 1,142,850

Output without underline formatting:

EXHIBIT A

SCHEDULE OF PURCHASERS

Initial Closing Date: March 30, 2023

Purchaser Investment Amount No. of Series Seed Preferred
Shares
Investor Fund 1, L.P.
[email protected] $2,000,000 571,428
Investor Fund 2, L.P.
[email protected] $1,000,000 285,714
Investor Fund 2-B, L.P.
[email protected] $200,000 57,142
Private Investor 1
[email protected] $75,000 21,428
Private Investor 2
[email protected] $100,000 28,571
Private Investor 3
[email protected] $95,000 27,142
Private Investor 4
[email protected] $95,000 27,142
Private Investor 5
[email protected] $200,000 57,142
Private Investor 6
[email protected] $50,000 14,285
Private Investor 7
[email protected] $75,000 21,428
Private Investor 8
[email protected] $110,000 31,428

TOTALS: $4,000,000 1,142,850
Copy link

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

@kellemNegasi
Copy link

Hi @lavens , thank you for reporting this issue with the sample file and code. We were able to reproduce it and we have created a ticket to look into it. We will write an update as soon as we figure this out.

@lavens
Copy link
Author

lavens commented Jul 19, 2024

Adding an update to this, we found that the problem is not exclusive to cells that have underlined text but also cells where a double newline exists between text.

I've included an example pdf with two table rows. The first row contains a cell with a double newline, and the second row does not. The extracted text for the cell with a double newline in the first row is inverted similarly to the underline text issue mentioned above.

Table with double newline cell.pdf

EXHIBIT A

INITIAL STOCK ISSUANCE TABLE

Name | Shares and Price | Amount and Form of Consideration | Vesting Schedule
Michelle Smith | 5,500,000 shares of Voting Common Stock at $0.0001 per share | Cash as described in the form of Common Stock Purchase Agreement as previously presented to the board having a value of at least $550.00 | In the event of a Change of Control, 100% of the Vesting Shares shall vest as described in the Common Stock Purchase Agreement. 100% of the Common Shares are subject to vesting (the “Vesting Shares”). 25% of the Vesting Shares shall vest on July 1, 2021 and 1/48th of the Vesting Shares shall vest monthly thereafter.
Derek Saunders | 4,500,000 shares of Voting Common Stock at $0.0001 per share | Cash as described in the form of Common Stock Purchase Agreement as previously presented to the board having a value of at least $450.00 | 100% of the Common Shares are subject to vesting (the “Vesting Shares”). 25% of the Vesting Shares shall vest on July 1, 2021 and 1/48th of the Vesting Shares shall vest monthly thereafter. In the event of a Change of Control, 100% of the Vesting Shares shall vest as described in the Common Stock Purchase Agreement.

cc @kellemNegasi

@kellemNegasi
Copy link

Hi @lavens , Thank you for the update and additional information. A fix to the previous case has been merged and also fixes this case too. We will write an update on this ticket as soon as it is released so that you can try it out.

@kellemNegasi
Copy link

A fix for this issue have been released here
Hi @lavens can you try out the new version of UniPDF (v3.61.0)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants