Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show modifications made by TurkishSentenceNormalizer #220

Open
mrmutator opened this issue May 8, 2019 · 5 comments
Open

Show modifications made by TurkishSentenceNormalizer #220

mrmutator opened this issue May 8, 2019 · 5 comments

Comments

@mrmutator
Copy link

Hi,

The TurkishSentenceNormalizer.normalize(String string) method takes a string and returns the normalized string as a result. For my purposes, I run the tokenizer on the normalized string, but I need to know the original substring of each token from before the normalization. So it would be good if the normalize() method could, for example, return a mapping from each character of the normalized string to its substring in the original string.

For example:

tbrklr dimi is normalized and then tokenized into [tebrikler], [değil], [mi] so it would be good to know that the first token has its origin in the substring tbrklr, the second in the substring dimi and the third also in the substring dimi (since there is a normalization step that splits the word dimi into two tokens)

@ahmetaa
Copy link
Owner

ahmetaa commented May 14, 2019

This functionality does not exist yet. Implementing this may not be trivial but I will see what I can do.

@mrmutator
Copy link
Author

I will try to provide a pull request for this soon.

@mrmutator
Copy link
Author

I tried to implement this in the PR #224 . Please have a look.

@mdakin
Copy link
Collaborator

mdakin commented Jun 13, 2019

Thanks, I will have a look soon.

@mdakin
Copy link
Collaborator

mdakin commented Jun 14, 2019

@mrmutator I have a couple of questions,

  1. Could you add some unit test so different use cases are easily visible (and it is always good to have tests)
  2. This implementation creates a pair of ints (a range) for each character in the output, I presume there would be a lot of repetitions for these ranges e.g. for your example all characters in [tebrikler] would be pointing to the same range, so maybe instead of per character, it should be per token based? Or maybe some kind of disjoint set structure would be of help?
  3. Could you pass your code through a formatter, we use Google format (explained here: https://github.com/ahmetaa/zemberek-nlp/wiki/Zemberek-For-Developers#changing-code-style)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants