-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Parsers #22
Comments
I have juste discovered this plugin and have the same question. I also use dmsf but had no success to have fulltext with Xapian. |
The nice things about integrating more directly into Elastic Search is:
I do not think that the integrations would be difficult and I would certainly be interested in co-funding a project to do the integration if others were also interested. Ping me if this also interests you. |
I would also be interested in this module being updated to include document content searches - I actually thought it did tbh... |
This plugin require Mapper Attachments Type for Elasticsearch. "The mapper attachments plugin adds the attachment type to Elasticsearch using Apache Tika." So, you already can search by attachments content. But there is a restriction in size of attachments. Now it's 1mb. I would like to add some settings (include indexed size of attachments) into the plugin configuration but have no time for now. Reference to #31 |
OK - So I just ran a test where I created a new document in 3 formats - .txt, .docx and .pdf. The documents contained a known string in each file. I uploaded these documents into a project in the 'Documents', 'Files' and 'DMSF' tabs. All files were very small (way less than 1MB) When I then searched for the string, it only returned the .txt file version - and only within the 'Documents' folder. Do I need to do some extra config somewhere to do searching in the DMSF upload location and/or the Files location? (Like others I'm using DMSF and would like to continue doing so, but had no success with Xapian - we're running on a Windows platform) Also, do I need to do some config to search in .docx and .pdf files? and what about other formats? Thanks - Steve |
I've been doing more digging on this. I have got a fresh Redmine Database and added 3 issues to it, each with an attachment containing the same known string (qwerty1). Two are in .txt files, one is in a .doc file. I have then reindexed and checked the content in the index itself using the Chrome 'Sense' plugin. The query I ran was: POST redmineapp_production/_search? And the result: { So, in all 3 issues the attachment has been set up and the 'file' content set. Yet if I then look for the known string - 'qwerty1' - in issues, Redmine only returns back issues 1 and 3, not issue 2. If I try the search before indexing then nothing is found, so I'm happy that it is the ElasticSearch plugin finding the results. I've also confirmed the 'Hello World' example on this link: https://github.com/elastic/elasticsearch-mapper-attachments works as expected. So, I believe everything in the index itself is correctly populated, but the actual search is failing to locate the content. Interestingly, when I run this query (again within 'sense'): POST redmineapp_production/_search? I only get the 2 .txt file attachment items returned, which would seem to indicate that while an index entry for the '.doc' file has been created, the actual search for the known string fails at that level too. Does anyone have any ideas on this at all? If there's any more information that'd be needed to assist in working it out - please do let me know! Thanks - Steve |
Interesting. Testing using: https://www.base64decode.org/ shows "cXdlcnR5MQ==\n" decodes to qwerty1 Does this imply that the base64 encoding for the .doc file content is incorrect? Or am I entirely missing the point somewhere? |
I suspect that what's actually happening here is the tika conversion from binary content to something readable is not happening - so effectively it's the binary content which has been base64 encoded. The reason I suspect this is because I also saw similarly peculiar results for .docx and .pdf files, so did some more digging.... def file and changed this (as a temporary fudge only!) to: def file Thus allowing a very rough-and-ready output to screen during the indexing of the content which is about to be encoded. Sure enough the .txt file content is just 'qwerty1' but the other formats - .doc, .docx and .pdf - all look like binary content, while a .jpg shows as 'unsupported' - Which to me indicates that in general things are working as expected, with the exception of the tika functionality. I downloaded tika-app-1-10.jar and running that on the .doc file using: java -jar tika-app-1.10.jar F:\BitNami\redmine-2.6.7-1\apps\redmine\htdocs\files\2015\10\151023163538_testdoc1.doc I get output of: So this (or a subset of this maybe) is the sort of content I would expect to be being passed into the base64 encoding. Perhaps just the content from qwerty1 could reasonably be expected to be placed into the coding for the file content..?So, it would appear that tika is fully capable of converting the content to readable form prior to encoding, but in the indexing process it's just not happening. I've also checked to ensure that the plugin is installed and in place (using the Chrome Sense plugin) { As you can see, I'm only running the one node and both of the required plugins are in place. Can anyone please give me some guidance as to what I may be missing here? Thanks - Steve |
Hi Steve, Looks like you are getting close. I hope to have some time early next week to contribute to this. I'm not an ElasticSearch expert but I do have an interest in getting this working for our documents and PDFs. It does look like Tika is capable of parsing so I guess the question is whether Tika is running properly within the 'Mapper Attachments Type' and whether that output is actually being stored in Elastic so that it can be found (perhaps it's being misfiled somehow). I'll also see if I can get someone with a bit more elastic search experience to lend a hand. Regards |
Thanks Chis - Any help would be very much appreciated; I've spent way more time on this than I wanted to already - and I'm running out of ideas of what to do next; perhaps I'll try to work out a debug session somehow... @nodecarter - I'd love to hear any thoughts or input you have on this based upon my previous comments - you seem to be 'the main man' on this! Regards - Steve |
OK - I've spent a further few days on this - but still with no good end result. I've grabbed the source for the AttachmentMapper ElasticSearch plugin and built that, then used that to try and work out whats going on (I've not managed to get a full debugger working unfortunately, so it's a slow job using simple logging) If I try a straightforward document parse in AttachmentMapper.java (part of the ElasticSearch mapper-attachments plugin) from within the routine: File file = new File("test1.doc"); (or the same with a pdf document) then it works fine, returning the content of the document as I'd expect. So, once again this points to the fact that Tika (which I've also built from source now but cannot directly debug) is working as necessary and it's something about the content being passed to it from the Redmine ElasticSearch plugin that's not working as expected. @carukc - Chris - Are you able to get one of your elasticsearch bods onto this? I've burned about as much time as I possibly can on this before I have to abandon it and move onto other projects. Thanks - Steve |
Hello, Cool plugin, I look forward to trying it out! This is a question rather than an issue.
Have you considered using the Elastic Search / Tika attachment plugins to parse the contents of documents for inclusion in Search?
I'm currently using dmsf for Documents (I definitely recommend trying it). It uses Xapian to provide some full text search capabilities but I would love to be able to get at more complex queries and be able to restrict searches of documents to the owner of a document or at the very least to members of a particular project. From what I have ready, Elastic itself can be configured to keep things private.
Thanks
Chris
The text was updated successfully, but these errors were encountered: