-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract a gate library that only does NLP #65
Comments
We use Tika for loading PDFs and some of the other document formats. If you don't need those formats (i.e. you just load plain text, html or XML) then you might be able to just exclude Tika from being a GATE dependency. |
Is there a reason why we need to depend on such an old version of Tika? BTW @bamthomas what kind of conflict do you get exactly? |
I think when I last tried to update it I got some odd errors from the unit tests and I didn't have the time to investigate them properly so left it on 1.7. It would make sense to upgrade it if we can. |
While waiting for a long download I tried to use Tika version 1.20 with the latest 8.6-SNAPSHOT code of gate-core and it turns out that now two of the libraries which are currently excluded need to get included in the pom dependencies: When I remove those from the excludes, the compile and unit tests work fine. Using that GATE version from some of my LF tests and pipelines did not show any obvious bugs or problems. |
Odd. I thought both of those were only used by the image formats which we don't need, but if not excluding them works, then I say make the changes and let's update to Tika 1.20. |
The conflict seems to be on pdfbox. When I include Gate, I have errors like :
and other file types, that I don't have when I'm not including Gate jar. I have relocated with the maven shade plugin the tika/lucene/pdfbox/fontbox/james libraries. But yes I could try to remove the dependencies when packaging my jar, even though I don't find this elegant. |
I do not really understand that exception, but we have in the meantime upgraded the Tika dependency of gate-core version 8.6-SNAPSHOT to Tika version 1.20. You may want to try if using that version improves anything (the SNAPSHOT is staged on our own repo at http://repo.gate.ac.uk/content/groups/public/) Alternately you could just try having a dependency on your preferred Tika version in your top-level pom which maybe would then override the dependencies inferred from the GATE dependency, but I am not sure how exactly this gets handled by the shade plugin. |
There is a trick to do this by replacing the default creole.xml that gets loaded at |
We are using Gate in our project http://github.com/ICIJ/datashare among other NLP pipelines (OpenNLP, CoreNLP, IxaPipe and Mitie).
We already have a library that extracts text from various files http://github.com/ICIJ/extract and is also using Tika (pdfbox...). We have conflicts with the Tika versions.
Would it be possible to extract a gate library that does only NLP annotations (text -> annotated text) without the text extraction stuffs ?
I can try to make a PR but before that I wanted to know what you guys are thinking about this?
Thank you for your answer(s).
The text was updated successfully, but these errors were encountered: