-
-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
portions of strings getting cut off with "..." #384
Comments
thanks @BCorbeek this sounds like an issue in the actual backend server library. It's possible that it is being truncated there. I am about to release the last 1.24.x branch release for tika-python and then prepare a release that upgrades to using Tika 2.6.x (which will obviously break back compat amongst other things). Can you try asking on |
This is puzzling from the backend perspective... I haven't seen this in straight tika-server, and I can't think of how this would happen. There is a default max write length, but that just truncates the string. It doesn't do anything on a per paragraph level and nothing nearly as short as 240 characters. If you curl the file against tika-server (2.6.0, say), do you get the same behavior? |
Are there any updates on this issue? For anyone parsing pdf to text, losing text (particularly without knowing it) is probably the worst that can happen. |
Can't make progress without user input. :( Can you curl the file against tika-server directly and see if you get the same behavior? Can you share an example file with me offline? tallison [AT] apache [DOT] org Is there a chance that what you're seeing is truncated bookmarks: stackoverflow content of li tags truncated? I recently found that some PDFs truncate their own bookmarks. |
That's something I definitely wouldn't expect! Just created a test pdf with LibreOffice but everything seems ok: Content:
parsed = parser.from_file("truncation_test_tika.pdf")
parsed["content"]
I am currently running some tests and comparisons between Tika, pdfminer, pdfplumber, PyMuPDF. If I encounter any kind of truncation I'll let you know. |
On a second thought, this test might not be representative as there is a myriad of export options available. Also, it seems as if LibreOffice always adds line breaks ( I agree that @BCorbeek 's pdf would be really helpful here. |
Hi,
I've gotten tika to work great for a while parsing PDFs - but realised recently that paragraphs longer than 240 characters or so (including spaces) are getting cut off/truncated. Is there any way to increase the substring size that is output by parser.from_file()?
Here's an example of my output:
The above issue with item 6.2 is what I'm struggling to figure out - I haven't found any way to change the maximum string length that's output.
parsed = parser.from_file(file)
parsed["content"][:-1]
adding [:-1] to be explicit doesn't work, I believe that affects the string as a whole, not the substrings.
Any help would be greatly appreciated!
The text was updated successfully, but these errors were encountered: