Very long titles when converting to markdown #158

Fianax · 2024-09-30T15:25:16Z

I have pdfs with titles that occupy 2 or more lines and when the pdf is transformed to markdown, they are cut (because the pdf is cut).

I attach the original pdf and the generated markdown file:

prueba_indices_enormes.pdf
prueba_indices_enormes_new_markdown.md

The content of the pdf is invented, the important thing is the result it gives with the indexes.

You can see that, when the index is very large and the pdf itself divides it into several lines, a small space is given and the new line has no '#' to indicate that it is part of the section title.

Is it something normal? is it an error in the markdown transformation?

I'm using pdf4llm version==0.0.9

md_text = pdf4llm.to_markdown(
        doc='temp/prueba_indices_enormes.pdf',
        margins=0,
    )

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-09-30T16:17:51Z

The current logic does already detect when multiple line with equal header level font size follow each other.

But it does not yet remove always all line breaks when joining the header text fragments.
This fix ensures this now.

JorjMcKie · 2024-09-30T16:19:37Z

Thanks for reporting this.
This bug was present in your package version. In the future please make sure to confirm bugs with the current version.

Fianax · 2024-09-30T16:36:27Z

The current logic does already detect when multiple line with equal header level font size follow each other.

But it does not yet remove always all line breaks when joining the header text fragments. This fix ensures this now.

So it's already solved?

Thanks for the answer and the speed

Fianax · 2024-09-30T16:38:12Z

Thanks for reporting this. This bug was present in your package version. In the future please make sure to confirm bugs with the current version.

version ==0.0.9 is not the latest version for pdf4llm?

I thought it was because page pdf4llm said it was the latest.

Sorry for the confusion

JorjMcKie · 2024-09-30T16:40:18Z

Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.

BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.

JorjMcKie · 2024-09-30T16:42:53Z

That was a good point of yours though. I will make sure that the versions coincide in the future.

Fianax · 2024-09-30T16:43:52Z

Ah ok, I see. That other repo is just an alias of pymupdf4llm and therefore automatically is current.

BTW "fix developed" means that I have a fix locally. It is not yet published on PyPI.

okey

thank you very much for the help and the explanation of 'fix developed'.

I will wait for the correction

Fianax · 2024-09-30T16:44:24Z

That was a good point of yours though. I will make sure that the versions coincide in the future.

thanks to you for keeping the package “alive”.

JorjMcKie added bug Something isn't working fix developed labels Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very long titles when converting to markdown #158

Very long titles when converting to markdown #158

Fianax commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

Fianax commented Sep 30, 2024

Fianax commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

Fianax commented Sep 30, 2024

Fianax commented Sep 30, 2024

Very long titles when converting to markdown #158

Very long titles when converting to markdown #158

Comments

Fianax commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

Fianax commented Sep 30, 2024

Fianax commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

JorjMcKie commented Sep 30, 2024

Fianax commented Sep 30, 2024

Fianax commented Sep 30, 2024