Document Parsers #22

carukc · 2015-02-24T15:03:34Z

Hello, Cool plugin, I look forward to trying it out! This is a question rather than an issue.

Have you considered using the Elastic Search / Tika attachment plugins to parse the contents of documents for inclusion in Search?

I'm currently using dmsf for Documents (I definitely recommend trying it). It uses Xapian to provide some full text search capabilities but I would love to be able to get at more complex queries and be able to restrict searches of documents to the owner of a document or at the very least to members of a particular project. From what I have ready, Elastic itself can be configured to keep things private.

Thanks
Chris

espinosa · 2015-04-17T20:05:46Z

I have juste discovered this plugin and have the same question. I also use dmsf but had no success to have fulltext with Xapian.

carukc · 2015-04-18T11:43:31Z

The nice things about integrating more directly into Elastic Search is:

Documents can be private such that search results only include docs that the user has access to
Elastic Search supports a broader range of doc types
one can incorporate items from other systems in the Elastic Search search results. In other words, several systems can hook into the Elastic Search server.
Elastic Search will scale better
etc.

I do not think that the integrations would be difficult and I would certainly be interested in co-funding a project to do the integration if others were also interested. Ping me if this also interests you.

SteveDavis · 2015-10-16T09:31:18Z

I would also be interested in this module being updated to include document content searches - I actually thought it did tbh...

nodecarter · 2015-10-16T09:53:55Z

This plugin require Mapper Attachments Type for Elasticsearch. "The mapper attachments plugin adds the attachment type to Elasticsearch using Apache Tika." So, you already can search by attachments content.

But there is a restriction in size of attachments. Now it's 1mb. I would like to add some settings (include indexed size of attachments) into the plugin configuration but have no time for now. Reference to #31

SteveDavis · 2015-10-16T11:36:59Z

OK - So I just ran a test where I created a new document in 3 formats - .txt, .docx and .pdf. The documents contained a known string in each file. I uploaded these documents into a project in the 'Documents', 'Files' and 'DMSF' tabs. All files were very small (way less than 1MB)

When I then searched for the string, it only returned the .txt file version - and only within the 'Documents' folder.

Do I need to do some extra config somewhere to do searching in the DMSF upload location and/or the Files location? (Like others I'm using DMSF and would like to continue doing so, but had no success with Xapian - we're running on a Windows platform)

Also, do I need to do some config to search in .docx and .pdf files? and what about other formats?

Thanks - Steve

SteveDavis · 2015-10-23T16:24:53Z

I've been doing more digging on this. I have got a fresh Redmine Database and added 3 issues to it, each with an attachment containing the same known string (qwerty1). Two are in .txt files, one is in a .doc file. I have then reindexed and checked the content in the index itself using the Chrome 'Sense' plugin. The query I ran was:

POST redmineapp_production/_search?
{
"query": { "match": { "_type": "issue" } },
"_source": ["title", "attachments"]
}

And the result:

{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "redmineapp_production",
"_type": "issue",
"_id": "1",
"_score": 1,
"_source": {
"attachments": [
{
"filename": "testdoc1.txt",
"file": "cXdlcnR5MQ==\n",
"content_type": "text/plain",
"created_on": "2015-10-23T15:31:43Z",
"downloads": 0,
"author": "IT Services Admin",
"digest": "6dbd0fe19c9a301c4708287780df41a2",
"description": "",
"filesize": 7,
"author_id": 1
}
],
"title": "Bug #1 (New): test issue 1"
}
},
{
"_index": "redmineapp_production",
"_type": "issue",
"_id": "2",
"_score": 1,
"_source": {
"attachments": [
{
"filename": "testdoc1.doc",
"file": "0M8R4KGx\n",
"content_type": "application/msword",
"created_on": "2015-10-23T15:35:38Z",
"downloads": 0,
"author": "IT Services Admin",
"digest": "4bf0320483081d85bfb19d1c74ac08d9",
"description": "",
"filesize": 22016,
"author_id": 1
}
],
"title": "Bug #2 (New): test issue 2"
}
},
{
"_index": "redmineapp_production",
"_type": "issue",
"_id": "3",
"_score": 1,
"_source": {
"attachments": [
{
"filename": "testdoc1.txt",
"file": "cXdlcnR5MQ==\n",
"content_type": "text/plain",
"created_on": "2015-10-23T15:43:46Z",
"downloads": 0,
"author": "IT Services Admin",
"digest": "6dbd0fe19c9a301c4708287780df41a2",
"description": "",
"filesize": 7,
"author_id": 1
}
],
"title": "Bug #3 (New): Test Issue 3"
}
}
]
}
}

So, in all 3 issues the attachment has been set up and the 'file' content set. Yet if I then look for the known string - 'qwerty1' - in issues, Redmine only returns back issues 1 and 3, not issue 2.

If I try the search before indexing then nothing is found, so I'm happy that it is the ElasticSearch plugin finding the results.

I've also confirmed the 'Hello World' example on this link: https://github.com/elastic/elasticsearch-mapper-attachments works as expected.

So, I believe everything in the index itself is correctly populated, but the actual search is failing to locate the content.

Interestingly, when I run this query (again within 'sense'):

POST redmineapp_production/_search?
{
"query": {
"bool": {
"must": [
{ "match": { "_type": "issue" } },
{ "match": { "attachments.file": "qwerty1" } }
]
}
}
}

I only get the 2 .txt file attachment items returned, which would seem to indicate that while an index entry for the '.doc' file has been created, the actual search for the known string fails at that level too.

Does anyone have any ideas on this at all? If there's any more information that'd be needed to assist in working it out - please do let me know!

Thanks - Steve

SteveDavis · 2015-10-23T16:29:56Z

Interesting. Testing using: https://www.base64decode.org/ shows

"cXdlcnR5MQ==\n" decodes to qwerty1
but when I input "0M8R4KGx\n" it returns nothing

Does this imply that the base64 encoding for the .doc file content is incorrect? Or am I entirely missing the point somewhere?

SteveDavis · 2015-10-24T23:09:07Z

I suspect that what's actually happening here is the tika conversion from binary content to something readable is not happening - so effectively it's the binary content which has been base64 encoded. The reason I suspect this is because I also saw similarly peculiar results for .docx and .pdf files, so did some more digging....
I located the point at which the base64 encoding is performed (in attachment_serializer.rb):

def file
content = supported? ? File.read(object.diskfile) : UNSUPPORTED
Base64.encode64(content)
end

and changed this (as a temporary fudge only!) to:

def file
content = supported? ? File.read(object.diskfile) : UNSUPPORTED
puts content
Base64.encode64(content)
end

Thus allowing a very rough-and-ready output to screen during the indexing of the content which is about to be encoded. Sure enough the .txt file content is just 'qwerty1' but the other formats - .doc, .docx and .pdf - all look like binary content, while a .jpg shows as 'unsupported' - Which to me indicates that in general things are working as expected, with the exception of the tika functionality.

I downloaded tika-app-1-10.jar and running that on the .doc file using:

java -jar tika-app-1.10.jar F:\BitNami\redmine-2.6.7-1\apps\redmine\htdocs\files\2015\10\151023163538_testdoc1.doc

I get output of:

So this (or a subset of this maybe) is the sort of content I would expect to be being passed into the base64 encoding. Perhaps just the content from

qwerty1

could reasonably be expected to be placed into the coding for the file content..?

So, it would appear that tika is fully capable of converting the content to readable form prior to encoding, but in the indexing process it's just not happening.

I've also checked to ensure that the plugin is installed and in place (using the Chrome Sense plugin)
GET _nodes/plugins

{
"cluster_name": "elasticsearch",
"nodes": {
"WSGP5SSSQGGoEkTsBP3ReA": {
"name": "Volpan",
"transport_address": "inet[/10.137.172.88:9300]",
"host": "RedmineDev",
"ip": "10.137.172.88",
"version": "1.7.2",
"build": "e43676b",
"http_address": "inet[/10.137.172.88:9200]",
"plugins": [
{
"name": "mapper-attachments",
"version": "2.7.1",
"description": "Adds the attachment type allowing to parse difference attachment formats",
"jvm": true,
"site": false
},
{
"name": "analysis-morphology",
"version": "NA",
"description": "Morphology analysis support",
"jvm": true,
"site": false
}
]
}
}
}

As you can see, I'm only running the one node and both of the required plugins are in place.

Can anyone please give me some guidance as to what I may be missing here?

Thanks - Steve

carukc · 2015-10-25T11:56:35Z

Hi Steve,

Looks like you are getting close. I hope to have some time early next week to contribute to this. I'm not an ElasticSearch expert but I do have an interest in getting this working for our documents and PDFs.

It does look like Tika is capable of parsing so I guess the question is whether Tika is running properly within the 'Mapper Attachments Type' and whether that output is actually being stored in Elastic so that it can be found (perhaps it's being misfiled somehow).

I'll also see if I can get someone with a bit more elastic search experience to lend a hand.

Regards
Chris

SteveDavis · 2015-10-25T14:09:32Z

Thanks Chis - Any help would be very much appreciated; I've spent way more time on this than I wanted to already - and I'm running out of ideas of what to do next; perhaps I'll try to work out a debug session somehow...

@nodecarter - I'd love to hear any thoughts or input you have on this based upon my previous comments - you seem to be 'the main man' on this!

Regards - Steve

SteveDavis · 2015-11-10T17:48:53Z

OK - I've spent a further few days on this - but still with no good end result.

I've grabbed the source for the AttachmentMapper ElasticSearch plugin and built that, then used that to try and work out whats going on (I've not managed to get a full debugger working unfortunately, so it's a slow job using simple logging)

If I try a straightforward document parse in AttachmentMapper.java (part of the ElasticSearch mapper-attachments plugin) from within the routine:
public void parse(ParseContext context) throws IOException
using:

File file = new File("test1.doc");
String filecontent = tika.parseToString(file);
logger.warn("Basic .doc parsing output: " + filecontent);

(or the same with a pdf document) then it works fine, returning the content of the document as I'd expect.

So, once again this points to the fact that Tika (which I've also built from source now but cannot directly debug) is working as necessary and it's something about the content being passed to it from the Redmine ElasticSearch plugin that's not working as expected.

@carukc - Chris - Are you able to get one of your elasticsearch bods onto this? I've burned about as much time as I possibly can on this before I have to abandon it and move onto other projects.
or
@nodecarter - Are you able to put any thought into what's going on here?

Thanks - Steve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Parsers #22

Document Parsers #22

carukc commented Feb 24, 2015

espinosa commented Apr 17, 2015

carukc commented Apr 18, 2015

SteveDavis commented Oct 16, 2015

nodecarter commented Oct 16, 2015

SteveDavis commented Oct 16, 2015

SteveDavis commented Oct 23, 2015

SteveDavis commented Oct 23, 2015

SteveDavis commented Oct 24, 2015

carukc commented Oct 25, 2015

SteveDavis commented Oct 25, 2015

SteveDavis commented Nov 10, 2015

Document Parsers #22

Document Parsers #22

Comments

carukc commented Feb 24, 2015

espinosa commented Apr 17, 2015

carukc commented Apr 18, 2015

SteveDavis commented Oct 16, 2015

nodecarter commented Oct 16, 2015

SteveDavis commented Oct 16, 2015

SteveDavis commented Oct 23, 2015

SteveDavis commented Oct 23, 2015

SteveDavis commented Oct 24, 2015

carukc commented Oct 25, 2015

SteveDavis commented Oct 25, 2015

SteveDavis commented Nov 10, 2015