You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running the heideltime on estonian wikipedia dump dated 2017-10-20 (available at https://dumps.wikimedia.org/etwiki/20171020/), AllLanguagesTokenizer runs into following exception message:
The estonian wiki document that causes this issue is https://et.wikipedia.org/wiki/?curid=16992
This document, in mongodb collection, has special unicode characters LSEP in it and I think they are causing this problem. On doing some debugging I found out that in AllLanguagesTokenizer.java, the iterator in the line FSIterator tokIt = jcas.getAnnotationIndex(Token.type).iterator(); of function sentenceTokenize
doesn't returns correct indices for the tokens.
For instance the first word of the estonian wiki document 16992 is Tallinna, so the Token t when assigned in the first iteration of while loop i.e. while(tokIt.hasNext()) by the command t = (Token) tokIt.next();
t should have begin 0 and end 8, but it is having begin as -1 and end as 10.
Hi,
While running the heideltime on estonian wikipedia dump dated 2017-10-20 (available at https://dumps.wikimedia.org/etwiki/20171020/), AllLanguagesTokenizer runs into following exception message:
Please have a look into it. Thank-you.
Best regards,
Faraz Ahmad
The text was updated successfully, but these errors were encountered: