portions of strings getting cut off with "..." #384

BCorbeek · 2022-12-22T16:35:41Z

Hi,
I've gotten tika to work great for a while parsing PDFs - but realised recently that paragraphs longer than 240 characters or so (including spaces) are getting cut off/truncated. Is there any way to increase the substring size that is output by parser.from_file()?

Here's an example of my output:

5.8 abcd some words here, the sentence ends now
6.1 xyz a few words here, this is also fine
6.2 This paragraph happents to be more than 200 characters long, but gets cut off at around 240 characters. I need all the characters/words to be included – not excluded, so I can run functions on the output. Right now the regular expressions are not  running on the text foll…

The above issue with item 6.2 is what I'm struggling to figure out - I haven't found any way to change the maximum string length that's output.

parsed = parser.from_file(file)
parsed["content"][:-1]

adding [:-1] to be explicit doesn't work, I believe that affects the string as a whole, not the substrings.

Any help would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

chrismattmann · 2022-12-31T21:01:50Z

thanks @BCorbeek this sounds like an issue in the actual backend server library. It's possible that it is being truncated there. I am about to release the last 1.24.x branch release for tika-python and then prepare a release that upgrades to using Tika 2.6.x (which will obviously break back compat amongst other things). Can you try asking on [email protected] to see if it's a configuration in the tika server? CC'ing @tballison who will know mostly likely.

tballison · 2023-01-03T15:26:42Z

This is puzzling from the backend perspective... I haven't seen this in straight tika-server, and I can't think of how this would happen. There is a default max write length, but that just truncates the string. It doesn't do anything on a per paragraph level and nothing nearly as short as 240 characters.

If you curl the file against tika-server (2.6.0, say), do you get the same behavior?

do-me · 2023-05-02T16:43:59Z

Are there any updates on this issue?
@BCorbeek did you investigate any further?

For anyone parsing pdf to text, losing text (particularly without knowing it) is probably the worst that can happen.
In the jungle of finding the right pdf parser - from what I heard - Tika seemed to work best, so it would be a pity if there was some bug cutting off paragraphs.

tballison · 2023-05-02T16:57:59Z

Can't make progress without user input. :(

Can you curl the file against tika-server directly and see if you get the same behavior?

Can you share an example file with me offline? tallison [AT] apache [DOT] org

Is there a chance that what you're seeing is truncated bookmarks: stackoverflow content of li tags truncated? I recently found that some PDFs truncate their own bookmarks.

do-me · 2023-05-02T17:16:37Z

That's something I definitely wouldn't expect!

Just created a test pdf with LibreOffice but everything seems ok:

truncation_test_tika.pdf

Content:

5.8 abcd some words here, the sentence ends now
6.1 xyz a few words here, this is also fine
6.2 This paragraph happents to be more than 200 characters long, but gets cut off at around 240 characters. I need all the characters/words to be included – not excluded, so I can run functions on the output. Right now the regular expressions are not  running on the text foll (just repeating the text) This paragraph happents to be more than 200 characters long, but gets cut off at around 240 characters. I need all the characters/words to be included – not excluded, so I can run functions on the output. Right now the regular expressions are not  running on the text foll (just repeating the text)

parsed = parser.from_file("truncation_test_tika.pdf")
parsed["content"]

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n5.8 abcd some words here, the sentence ends now\n6.1 xyz a few words here, this is also fine\n6.2 This paragraph happents to be more than 200 characters long, but gets cut off at around 240 \ncharacters. I need all the characters/words to be included – not excluded, so I can run functions on \nthe output. Right now the regular expressions are not  running on the text foll (just repeating the \ntext) This paragraph happents to be more than 200 characters long, but gets cut off at around 240 \ncharacters. I need all the characters/words to be included – not excluded, so I can run functions on \nthe output. Right now the regular expressions are not  running on the text foll (just repeating the \ntext) \n\n\n

I am currently running some tests and comparisons between Tika, pdfminer, pdfplumber, PyMuPDF. If I encounter any kind of truncation I'll let you know.

do-me · 2023-05-02T17:22:15Z

On a second thought, this test might not be representative as there is a myriad of export options available. Also, it seems as if LibreOffice always adds line breaks (240 n\characters), so the paragraph is always shorter than 240 characters.

I agree that @BCorbeek 's pdf would be really helpful here.

chrismattmann added this to the tika-next milestone Dec 31, 2022

chrismattmann self-assigned this Dec 31, 2022

chrismattmann added bug enhancement help wanted question labels Dec 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

portions of strings getting cut off with "..." #384

portions of strings getting cut off with "..." #384

BCorbeek commented Dec 22, 2022

chrismattmann commented Dec 31, 2022

tballison commented Jan 3, 2023

do-me commented May 2, 2023

tballison commented May 2, 2023

do-me commented May 2, 2023

do-me commented May 2, 2023 •

edited

Loading

portions of strings getting cut off with "..." #384

portions of strings getting cut off with "..." #384

Comments

BCorbeek commented Dec 22, 2022

chrismattmann commented Dec 31, 2022

tballison commented Jan 3, 2023

do-me commented May 2, 2023

tballison commented May 2, 2023

do-me commented May 2, 2023

do-me commented May 2, 2023 • edited Loading

do-me commented May 2, 2023 •

edited

Loading