AllLanguagesTokenizer might run into StringIndexOutOfBoundsException #66

farazbhinder · 2017-11-24T10:58:05Z

Hi,

While running the heideltime on estonian wikipedia dump dated 2017-10-20 (available at https://dumps.wikimedia.org/etwiki/20171020/), AllLanguagesTokenizer runs into following exception message:

org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
	at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
	at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.processNext(ProcessingUnit.java:893)
	at org.apache.uima.collection.impl.cpm.engine.ProcessingUnit.run(ProcessingUnit.java:575)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
	at java.lang.String.substring(String.java:1960)
	at org.apache.uima.jcas.tcas.Annotation.getCoveredText(Annotation.java:122)
	at de.unihd.dbs.uima.annotator.alllanguagestokenizer.AllLanguagesTokenizer.sentenceTokenize(AllLanguagesTokenizer.java:245)
	at de.unihd.dbs.uima.annotator.alllanguagestokenizer.AllLanguagesTokenizer.process(AllLanguagesTokenizer.java:34)
	at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
	at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
	... 4 more

Please have a look into it. Thank-you.

Best regards,
Faraz Ahmad

farazbhinder · 2018-02-17T00:03:20Z

The estonian wiki document that causes this issue is https://et.wikipedia.org/wiki/?curid=16992
This document, in mongodb collection, has special unicode characters LSEP in it and I think they are causing this problem. On doing some debugging I found out that in AllLanguagesTokenizer.java, the iterator in the line
FSIterator tokIt = jcas.getAnnotationIndex(Token.type).iterator(); of function sentenceTokenize
doesn't returns correct indices for the tokens.

For instance the first word of the estonian wiki document 16992 is Tallinna, so the Token t when assigned in the first iteration of while loop i.e. while(tokIt.hasNext()) by the command
t = (Token) tokIt.next();
t should have begin 0 and end 8, but it is having begin as -1 and end as 10.

The text of the document 16992 is attached
e1.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllLanguagesTokenizer might run into StringIndexOutOfBoundsException #66

AllLanguagesTokenizer might run into StringIndexOutOfBoundsException #66

farazbhinder commented Nov 24, 2017 •

edited

Loading

farazbhinder commented Feb 17, 2018 •

edited

Loading

AllLanguagesTokenizer might run into StringIndexOutOfBoundsException #66

AllLanguagesTokenizer might run into StringIndexOutOfBoundsException #66

Comments

farazbhinder commented Nov 24, 2017 • edited Loading

farazbhinder commented Feb 17, 2018 • edited Loading

farazbhinder commented Nov 24, 2017 •

edited

Loading

farazbhinder commented Feb 17, 2018 •

edited

Loading