-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show modifications made by TurkishSentenceNormalizer #220
Comments
This functionality does not exist yet. Implementing this may not be trivial but I will see what I can do. |
I will try to provide a pull request for this soon. |
I tried to implement this in the PR #224 . Please have a look. |
Thanks, I will have a look soon. |
@mrmutator I have a couple of questions,
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
The TurkishSentenceNormalizer.normalize(String string) method takes a string and returns the normalized string as a result. For my purposes, I run the tokenizer on the normalized string, but I need to know the original substring of each token from before the normalization. So it would be good if the normalize() method could, for example, return a mapping from each character of the normalized string to its substring in the original string.
For example:
tbrklr dimi
is normalized and then tokenized into[tebrikler], [değil], [mi]
so it would be good to know that the first token has its origin in the substringtbrklr
, the second in the substringdimi
and the third also in the substringdimi
(since there is a normalization step that splits the worddimi
into two tokens)The text was updated successfully, but these errors were encountered: