Show modifications made by TurkishSentenceNormalizer #220

mrmutator · 2019-05-08T11:58:10Z

Hi,

The TurkishSentenceNormalizer.normalize(String string) method takes a string and returns the normalized string as a result. For my purposes, I run the tokenizer on the normalized string, but I need to know the original substring of each token from before the normalization. So it would be good if the normalize() method could, for example, return a mapping from each character of the normalized string to its substring in the original string.

For example:

tbrklr dimi is normalized and then tokenized into [tebrikler], [değil], [mi] so it would be good to know that the first token has its origin in the substring tbrklr, the second in the substring dimi and the third also in the substring dimi (since there is a normalization step that splits the word dimi into two tokens)

The text was updated successfully, but these errors were encountered:

ahmetaa · 2019-05-14T11:39:26Z

This functionality does not exist yet. Implementing this may not be trivial but I will see what I can do.

mrmutator · 2019-05-16T13:55:33Z

I will try to provide a pull request for this soon.

mrmutator · 2019-06-11T14:54:24Z

I tried to implement this in the PR #224 . Please have a look.

mdakin · 2019-06-13T08:59:49Z

Thanks, I will have a look soon.

mdakin · 2019-06-14T11:39:55Z

@mrmutator I have a couple of questions,

Could you add some unit test so different use cases are easily visible (and it is always good to have tests)
This implementation creates a pair of ints (a range) for each character in the output, I presume there would be a lot of repetitions for these ranges e.g. for your example all characters in [tebrikler] would be pointing to the same range, so maybe instead of per character, it should be per token based? Or maybe some kind of disjoint set structure would be of help?
Could you pass your code through a formatter, we use Google format (explained here: https://github.com/ahmetaa/zemberek-nlp/wiki/Zemberek-For-Developers#changing-code-style)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show modifications made by TurkishSentenceNormalizer #220

Show modifications made by TurkishSentenceNormalizer #220

mrmutator commented May 8, 2019

ahmetaa commented May 14, 2019

mrmutator commented May 16, 2019

mrmutator commented Jun 11, 2019

mdakin commented Jun 13, 2019

mdakin commented Jun 14, 2019

Show modifications made by TurkishSentenceNormalizer #220

Show modifications made by TurkishSentenceNormalizer #220

Comments

mrmutator commented May 8, 2019

ahmetaa commented May 14, 2019

mrmutator commented May 16, 2019

mrmutator commented Jun 11, 2019

mdakin commented Jun 13, 2019

mdakin commented Jun 14, 2019