Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long titles when converting to markdown #158

Open
Fianax opened this issue Sep 30, 2024 · 8 comments
Open

Very long titles when converting to markdown #158

Fianax opened this issue Sep 30, 2024 · 8 comments
Labels
bug Something isn't working fix developed

Comments

@Fianax
Copy link

Fianax commented Sep 30, 2024

I have pdfs with titles that occupy 2 or more lines and when the pdf is transformed to markdown, they are cut (because the pdf is cut).

I attach the original pdf and the generated markdown file:

prueba_indices_enormes.pdf
prueba_indices_enormes_new_markdown.md

The content of the pdf is invented, the important thing is the result it gives with the indexes.

You can see that, when the index is very large and the pdf itself divides it into several lines, a small space is given and the new line has no '#' to indicate that it is part of the section title.

Is it something normal? is it an error in the markdown transformation?


I'm using pdf4llm version==0.0.9

md_text = pdf4llm.to_markdown(
        doc='temp/prueba_indices_enormes.pdf',
        margins=0,
    )
@JorjMcKie JorjMcKie added bug Something isn't working fix developed labels Sep 30, 2024
@JorjMcKie
Copy link
Contributor

The current logic does already detect when multiple line with equal header level font size follow each other.

But it does not yet remove always all line breaks when joining the header text fragments.
This fix ensures this now.

@JorjMcKie
Copy link
Contributor

Thanks for reporting this.
This bug was present in your package version. In the future please make sure to confirm bugs with the current version.

@Fianax
Copy link
Author

Fianax commented Sep 30, 2024

The current logic does already detect when multiple line with equal header level font size follow each other.

But it does not yet remove always all line breaks when joining the header text fragments. This fix ensures this now.

So it's already solved?

Thanks for the answer and the speed

@Fianax
Copy link
Author

Fianax commented Sep 30, 2024

Thanks for reporting this. This bug was present in your package version. In the future please make sure to confirm bugs with the current version.

version ==0.0.9 is not the latest version for pdf4llm?

I thought it was because page pdf4llm said it was the latest.

Sorry for the confusion

@JorjMcKie
Copy link
Contributor

Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.

BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.

@JorjMcKie
Copy link
Contributor

That was a good point of yours though. I will make sure that the versions coincide in the future.

@Fianax
Copy link
Author

Fianax commented Sep 30, 2024

Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.

BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.

okey

thank you very much for the help and the explanation of 'fix developed'.

I will wait for the correction

@Fianax
Copy link
Author

Fianax commented Sep 30, 2024

That was a good point of yours though. I will make sure that the versions coincide in the future.

thanks to you for keeping the package “alive”.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix developed
Projects
None yet
Development

No branches or pull requests

2 participants