Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Parsers #22

Open
carukc opened this issue Feb 24, 2015 · 11 comments
Open

Document Parsers #22

carukc opened this issue Feb 24, 2015 · 11 comments

Comments

@carukc
Copy link

carukc commented Feb 24, 2015

Hello, Cool plugin, I look forward to trying it out! This is a question rather than an issue.

Have you considered using the Elastic Search / Tika attachment plugins to parse the contents of documents for inclusion in Search?

I'm currently using dmsf for Documents (I definitely recommend trying it). It uses Xapian to provide some full text search capabilities but I would love to be able to get at more complex queries and be able to restrict searches of documents to the owner of a document or at the very least to members of a particular project. From what I have ready, Elastic itself can be configured to keep things private.

Thanks
Chris

@espinosa
Copy link

I have juste discovered this plugin and have the same question. I also use dmsf but had no success to have fulltext with Xapian.

@carukc
Copy link
Author

carukc commented Apr 18, 2015

The nice things about integrating more directly into Elastic Search is:

  • Documents can be private such that search results only include docs that the user has access to
  • Elastic Search supports a broader range of doc types
  • one can incorporate items from other systems in the Elastic Search search results. In other words, several systems can hook into the Elastic Search server.
  • Elastic Search will scale better
  • etc.

I do not think that the integrations would be difficult and I would certainly be interested in co-funding a project to do the integration if others were also interested. Ping me if this also interests you.

@SteveDavis
Copy link

I would also be interested in this module being updated to include document content searches - I actually thought it did tbh...

@nodecarter
Copy link
Contributor

This plugin require Mapper Attachments Type for Elasticsearch. "The mapper attachments plugin adds the attachment type to Elasticsearch using Apache Tika." So, you already can search by attachments content.

But there is a restriction in size of attachments. Now it's 1mb. I would like to add some settings (include indexed size of attachments) into the plugin configuration but have no time for now. Reference to #31

@SteveDavis
Copy link

OK - So I just ran a test where I created a new document in 3 formats - .txt, .docx and .pdf. The documents contained a known string in each file. I uploaded these documents into a project in the 'Documents', 'Files' and 'DMSF' tabs. All files were very small (way less than 1MB)

When I then searched for the string, it only returned the .txt file version - and only within the 'Documents' folder.

Do I need to do some extra config somewhere to do searching in the DMSF upload location and/or the Files location? (Like others I'm using DMSF and would like to continue doing so, but had no success with Xapian - we're running on a Windows platform)

Also, do I need to do some config to search in .docx and .pdf files? and what about other formats?

Thanks - Steve

@SteveDavis
Copy link

I've been doing more digging on this. I have got a fresh Redmine Database and added 3 issues to it, each with an attachment containing the same known string (qwerty1). Two are in .txt files, one is in a .doc file. I have then reindexed and checked the content in the index itself using the Chrome 'Sense' plugin. The query I ran was:

POST redmineapp_production/_search?
{
"query": { "match": { "_type": "issue" } },
"_source": ["title", "attachments"]
}

And the result:

{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "redmineapp_production",
"_type": "issue",
"_id": "1",
"_score": 1,
"_source": {
"attachments": [
{
"filename": "testdoc1.txt",
"file": "cXdlcnR5MQ==\n",
"content_type": "text/plain",
"created_on": "2015-10-23T15:31:43Z",
"downloads": 0,
"author": "IT Services Admin",
"digest": "6dbd0fe19c9a301c4708287780df41a2",
"description": "",
"filesize": 7,
"author_id": 1
}
],
"title": "Bug #1 (New): test issue 1"
}
},
{
"_index": "redmineapp_production",
"_type": "issue",
"_id": "2",
"_score": 1,
"_source": {
"attachments": [
{
"filename": "testdoc1.doc",
"file": "0M8R4KGx\n",
"content_type": "application/msword",
"created_on": "2015-10-23T15:35:38Z",
"downloads": 0,
"author": "IT Services Admin",
"digest": "4bf0320483081d85bfb19d1c74ac08d9",
"description": "",
"filesize": 22016,
"author_id": 1
}
],
"title": "Bug #2 (New): test issue 2"
}
},
{
"_index": "redmineapp_production",
"_type": "issue",
"_id": "3",
"_score": 1,
"_source": {
"attachments": [
{
"filename": "testdoc1.txt",
"file": "cXdlcnR5MQ==\n",
"content_type": "text/plain",
"created_on": "2015-10-23T15:43:46Z",
"downloads": 0,
"author": "IT Services Admin",
"digest": "6dbd0fe19c9a301c4708287780df41a2",
"description": "",
"filesize": 7,
"author_id": 1
}
],
"title": "Bug #3 (New): Test Issue 3"
}
}
]
}
}

So, in all 3 issues the attachment has been set up and the 'file' content set. Yet if I then look for the known string - 'qwerty1' - in issues, Redmine only returns back issues 1 and 3, not issue 2.

If I try the search before indexing then nothing is found, so I'm happy that it is the ElasticSearch plugin finding the results.

I've also confirmed the 'Hello World' example on this link: https://github.com/elastic/elasticsearch-mapper-attachments works as expected.

So, I believe everything in the index itself is correctly populated, but the actual search is failing to locate the content.

Interestingly, when I run this query (again within 'sense'):

POST redmineapp_production/_search?
{
"query": {
"bool": {
"must": [
{ "match": { "_type": "issue" } },
{ "match": { "attachments.file": "qwerty1" } }
]
}
}
}

I only get the 2 .txt file attachment items returned, which would seem to indicate that while an index entry for the '.doc' file has been created, the actual search for the known string fails at that level too.

Does anyone have any ideas on this at all? If there's any more information that'd be needed to assist in working it out - please do let me know!

Thanks - Steve

@SteveDavis
Copy link

Interesting. Testing using: https://www.base64decode.org/ shows

"cXdlcnR5MQ==\n" decodes to qwerty1
but when I input "0M8R4KGx\n" it returns nothing

Does this imply that the base64 encoding for the .doc file content is incorrect? Or am I entirely missing the point somewhere?

@SteveDavis
Copy link

I suspect that what's actually happening here is the tika conversion from binary content to something readable is not happening - so effectively it's the binary content which has been base64 encoded. The reason I suspect this is because I also saw similarly peculiar results for .docx and .pdf files, so did some more digging....
I located the point at which the base64 encoding is performed (in attachment_serializer.rb):

def file
content = supported? ? File.read(object.diskfile) : UNSUPPORTED
Base64.encode64(content)
end

and changed this (as a temporary fudge only!) to:

def file
content = supported? ? File.read(object.diskfile) : UNSUPPORTED
puts content
Base64.encode64(content)
end

Thus allowing a very rough-and-ready output to screen during the indexing of the content which is about to be encoded. Sure enough the .txt file content is just 'qwerty1' but the other formats - .doc, .docx and .pdf - all look like binary content, while a .jpg shows as 'unsupported' - Which to me indicates that in general things are working as expected, with the exception of the tika functionality.

I downloaded tika-app-1-10.jar and running that on the .doc file using:

java -jar tika-app-1.10.jar F:\BitNami\redmine-2.6.7-1\apps\redmine\htdocs\files\2015\10\151023163538_testdoc1.doc

I get output of:

image

So this (or a subset of this maybe) is the sort of content I would expect to be being passed into the base64 encoding. Perhaps just the content from

qwerty1

could reasonably be expected to be placed into the coding for the file content..?

So, it would appear that tika is fully capable of converting the content to readable form prior to encoding, but in the indexing process it's just not happening.

I've also checked to ensure that the plugin is installed and in place (using the Chrome Sense plugin)
GET _nodes/plugins

{
"cluster_name": "elasticsearch",
"nodes": {
"WSGP5SSSQGGoEkTsBP3ReA": {
"name": "Volpan",
"transport_address": "inet[/10.137.172.88:9300]",
"host": "RedmineDev",
"ip": "10.137.172.88",
"version": "1.7.2",
"build": "e43676b",
"http_address": "inet[/10.137.172.88:9200]",
"plugins": [
{
"name": "mapper-attachments",
"version": "2.7.1",
"description": "Adds the attachment type allowing to parse difference attachment formats",
"jvm": true,
"site": false
},
{
"name": "analysis-morphology",
"version": "NA",
"description": "Morphology analysis support",
"jvm": true,
"site": false
}
]
}
}
}

As you can see, I'm only running the one node and both of the required plugins are in place.

Can anyone please give me some guidance as to what I may be missing here?

Thanks - Steve

@carukc
Copy link
Author

carukc commented Oct 25, 2015

Hi Steve,

Looks like you are getting close. I hope to have some time early next week to contribute to this. I'm not an ElasticSearch expert but I do have an interest in getting this working for our documents and PDFs.

It does look like Tika is capable of parsing so I guess the question is whether Tika is running properly within the 'Mapper Attachments Type' and whether that output is actually being stored in Elastic so that it can be found (perhaps it's being misfiled somehow).

I'll also see if I can get someone with a bit more elastic search experience to lend a hand.

Regards
Chris

@SteveDavis
Copy link

Thanks Chis - Any help would be very much appreciated; I've spent way more time on this than I wanted to already - and I'm running out of ideas of what to do next; perhaps I'll try to work out a debug session somehow...

@nodecarter - I'd love to hear any thoughts or input you have on this based upon my previous comments - you seem to be 'the main man' on this!

Regards - Steve

@SteveDavis
Copy link

OK - I've spent a further few days on this - but still with no good end result.

I've grabbed the source for the AttachmentMapper ElasticSearch plugin and built that, then used that to try and work out whats going on (I've not managed to get a full debugger working unfortunately, so it's a slow job using simple logging)

If I try a straightforward document parse in AttachmentMapper.java (part of the ElasticSearch mapper-attachments plugin) from within the routine:
public void parse(ParseContext context) throws IOException
using:

File file = new File("test1.doc");
String filecontent = tika.parseToString(file);
logger.warn("Basic .doc parsing output: " + filecontent);

(or the same with a pdf document) then it works fine, returning the content of the document as I'd expect.

So, once again this points to the fact that Tika (which I've also built from source now but cannot directly debug) is working as necessary and it's something about the content being passed to it from the Redmine ElasticSearch plugin that's not working as expected.

@carukc - Chris - Are you able to get one of your elasticsearch bods onto this? I've burned about as much time as I possibly can on this before I have to abandon it and move onto other projects.
or
@nodecarter - Are you able to put any thought into what's going on here?

Thanks - Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants