Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllLanguagesTokenizer might run into StringIndexOutOfBoundsException #66

Open
farazbhinder opened this issue Nov 24, 2017 · 1 comment

Comments

@farazbhinder
Copy link

farazbhinder commented Nov 24, 2017

Hi,

While running the heideltime on estonian wikipedia dump dated 2017-10-20 (available at https://dumps.wikimedia.org/etwiki/20171020/), AllLanguagesTokenizer runs into following exception message:

org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
	at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
	at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:893)
	at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:575)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
	at java.lang.String.substring(String.java:1960)
	at org.apache.uima.jcas.tcas.Annotation.getCoveredText(Annotation.java:122)
	at de.unihd.dbs.uima.annotator.alllanguagestokenizer.AllLanguagesTokenizer.sentenceTokenize(AllLanguagesTokenizer.java:245)
	at de.unihd.dbs.uima.annotator.alllanguagestokenizer.AllLanguagesTokenizer.process(AllLanguagesTokenizer.java:34)
	at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
	... 4 more

Please have a look into it. Thank-you.

Best regards,
Faraz Ahmad

@farazbhinder
Copy link
Author

farazbhinder commented Feb 17, 2018

The estonian wiki document that causes this issue is https://et.wikipedia.org/wiki/?curid=16992
This document, in mongodb collection, has special unicode characters LSEP in it and I think they are causing this problem. On doing some debugging I found out that in AllLanguagesTokenizer.java, the iterator in the line
FSIterator tokIt = jcas.getAnnotationIndex(Token.type).iterator(); of function sentenceTokenize
doesn't returns correct indices for the tokens.

For instance the first word of the estonian wiki document 16992 is Tallinna, so the Token t when assigned in the first iteration of while loop i.e. while(tokIt.hasNext()) by the command
t = (Token) tokIt.next();
t should have begin 0 and end 8, but it is having begin as -1 and end as 10.

The text of the document 16992 is attached
e1.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant