unicode normalization #3

gregorycrane · 2024-10-13T14:02:02Z

I wonder if we should not normalize unicode as part of our Atlas data prep. I was looking on line about how to do it and found this code from some guy named Tauber ....
@jtauber @lcerrato @AlisonBabeu

from unicodedata import normalize
curword = normalize("NFC",m[1])

My thinking:

Anything in our repos should probably be normalized (e.g., the Greek from the Greco-Arabic corpus).
Anything we import into Atlas, we should normalize. That would imply some code in the Atlas data prep pipeline (I think)

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode normalization #3

unicode normalization #3

gregorycrane commented Oct 13, 2024

unicode normalization #3

unicode normalization #3

Comments

gregorycrane commented Oct 13, 2024