Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract a gate library that only does NLP #65

Open
bamthomas opened this issue Jan 15, 2019 · 8 comments
Open

Extract a gate library that only does NLP #65

bamthomas opened this issue Jan 15, 2019 · 8 comments

Comments

@bamthomas
Copy link

We are using Gate in our project http://github.com/ICIJ/datashare among other NLP pipelines (OpenNLP, CoreNLP, IxaPipe and Mitie).

We already have a library that extracts text from various files http://github.com/ICIJ/extract and is also using Tika (pdfbox...). We have conflicts with the Tika versions.

Would it be possible to extract a gate library that does only NLP annotations (text -> annotated text) without the text extraction stuffs ?

I can try to make a PR but before that I wanted to know what you guys are thinking about this?

Thank you for your answer(s).

@greenwoodma
Copy link
Contributor

We use Tika for loading PDFs and some of the other document formats. If you don't need those formats (i.e. you just load plain text, html or XML) then you might be able to just exclude Tika from being a GATE dependency.

@johann-petrak
Copy link
Contributor

Is there a reason why we need to depend on such an old version of Tika?
I think we depend on 1.7 from January 2015 and the current version is 1.20.

BTW @bamthomas what kind of conflict do you get exactly?

@greenwoodma
Copy link
Contributor

Is there a reason why we need to depend on such an old version of Tika?
I think we depend on 1.7 from January 2015 and the current version is 1.20.

BTW @bamthomas what kind of conflict do you get exactly?

I think when I last tried to update it I got some odd errors from the unit tests and I didn't have the time to investigate them properly so left it on 1.7. It would make sense to upgrade it if we can.

@johann-petrak
Copy link
Contributor

While waiting for a long download I tried to use Tika version 1.20 with the latest 8.6-SNAPSHOT code of gate-core and it turns out that now two of the libraries which are currently excluded need to get included in the pom dependencies:
com.adobe.xmp/xmpcore and com.drewnoakes/metadata-extractor

When I remove those from the excludes, the compile and unit tests work fine.
(Not including those gives class not found exceptions when running the unit tests. After including them, there is still a warning about a missing xerial's sqlite-jdbc, but the tests pass)

Using that GATE version from some of my LF tests and pipelines did not show any obvious bugs or problems.

@greenwoodma
Copy link
Contributor

Odd. I thought both of those were only used by the image formats which we don't need, but if not excluding them works, then I say make the changes and let's update to Tika 1.20.

@bamthomas
Copy link
Author

bamthomas commented Jan 16, 2019

The conflict seems to be on pdfbox. When I include Gate, I have errors like :

java.io.IOException: \
	at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:274)\
	at java.io.Reader.read(Reader.java:140)\
	at org.icij.spewer.Spewer.copy(Spewer.java:104)\
	at org.icij.spewer.Spewer.toString(Spewer.java:114)\
	at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.getMap(ElasticsearchSpewer.java:115)\
	at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.prepareRequest(ElasticsearchSpewer.java:81)\
	at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.indexDocument(ElasticsearchSpewer.java:134)\
	at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.write(ElasticsearchSpewer.java:72)\
	at org.icij.extract.extractor.Extractor.extract(Extractor.java:272)\
	at org.icij.extract.extractor.DocumentConsumer.lambda$accept$0(DocumentConsumer.java:125)\
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\
	at java.lang.Thread.run(Thread.java:748)\
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.tika.parser.rtf.TextExtractor\
	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:97)\
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\
	at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\
	at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:235)\

and other file types, that I don't have when I'm not including Gate jar. I have relocated with the maven shade plugin the tika/lucene/pdfbox/fontbox/james libraries.

But yes I could try to remove the dependencies when packaging my jar, even though I don't find this elegant.

@johann-petrak
Copy link
Contributor

I do not really understand that exception, but we have in the meantime upgraded the Tika dependency of gate-core version 8.6-SNAPSHOT to Tika version 1.20. You may want to try if using that version improves anything (the SNAPSHOT is staged on our own repo at http://repo.gate.ac.uk/content/groups/public/)

Alternately you could just try having a dependency on your preferred Tika version in your top-level pom which maybe would then override the dependencies inferred from the GATE dependency, but I am not sure how exactly this gets handled by the shade plugin.

@greenwoodma
Copy link
Contributor

There is a trick to do this by replacing the default creole.xml that gets loaded at Gate.init() but it requires knowing a lot about the internals of GATE and it needs to be an actual File. We're considering including a minimal version of the file inside gate-core.jar so that a single method call before initialization would allow you to switch to this version. You could then exclude Tika (and any other libs needed by the default resources) which would solve the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants