Convert traditional orthography into Latin or pronunciation text.
Text is a TypeScript library which transforms traditional orthography into Latin/Romanized text, using the Talk spec. TalkText can be used to render Tone, which is a unique and modern rune-like writing system for pronunciations.
Caveat: It's not always possible to do transform traditional orthography into pronunciation text across every language, especially on a language like English, where it is impossible to generate pronunciation based on written words. You must memorize individual cases in English, and in some other languages. However, some languages do have the ability to get pretty close to correct pronunciation based purely on the native spelling, which is pretty cool. Taking advantage of that fact here!
- Script detection.
- Romanization transliterations of scripts/languages in various forms.
- Structured script data, such as what are the vowels, etc..
- Keyboard layout data for various languages.
npm install @cluesurf/text
Here are some API examples.
import detect from '@cluesurf/text/detect'
detect([...'美丽的']) //=> { form: 'chinese', rank: 1 }
For these languages you can currently call make
:
language | status |
---|---|
akkadian | ✔ |
arabic | ✔ |
chinese | ✔ |
coptic | ✔ |
devanagari | ✔ |
finnish | ✔ |
french | ✔ |
geez | ✔ |
georgian | ✔ |
gothic | ✔ |
gujarati | ✔ |
gurmukhi | ✔ |
hebrew | 🔧 |
irish | 🔧 |
italian | 🔧 |
japanese | 🔧 |
kannada | 🔧 |
korean | 🔧 |
latin | 🔧 |
malayalam | 🔧 |
navajo | 🔧 |
old-norse | 🔧 |
old-persian | 🔧 |
oriya | 🔧 |
pali | 🔧 |
runic | 🔧 |
swahili | 🔧 |
tamil | 🔧 |
telugu | 🔧 |
thai | 🔧 |
tibetan | 🔧 |
turkish | 🔧 |
ugaritic | 🔧 |
vietnamese | 🔧 |
welsh | 🔧 |
import make, {
symbols,
vowels,
boundVowels,
consonants,
} from '@cluesurf/text/arabic'
make('جَمِيل') //=> "djami_l"
vowels.forEach(console.log)
import make from '@cluesurf/text/chinese'
make('měi lì de') //=> "me\\/i li\\ tO"
import toWylie from '@cluesurf/text/tibetan/wylie/to'
import fromWylie from '@cluesurf/text/tibetan/wylie/from'
toWylie('རིག་པ་') //=> "rig pa"
fromWylie('rig pa') //=> "རིག་པ"
Take the generated TalkText (the
ASCII output from the base make
calls), and convert it into a more
compact, human readable, "simplified" form.
import talk from '@cluesurf/talk'
talk('rIg ph~a') //=> "ṙịg pɦa"
Take the generated TalkText and convert it into a format compatible with ToneText fonts.
import talk from '@cluesurf/text/chinese'
import tone from '@cluesurf/tone'
tone(talk('měi lì de')) //=> "me8i li6 tO"
...which is rendered as:
Here is a table explaining which languages we've looked at so far which can and can't have pronunciations automatically done.
language | automatic | note |
---|---|---|
Chinese (Mandarin) | yes but not perfect | Pinyin can be used to auto generate pronunciations, but it doesn't always accurately reflect how people actually say each word, so it would be better to manually write each pronunciation if possible. |
Korean | yes but not perfect | |
Sanskrit | yes | With Devanagari, each sound has an exact pronunciation in Sanskrit, so we can get pretty close to exact pronunciations automatically done. |
Finnish | yes | |
Navajo | yes | Since it was fairly recently transcribed intoa Latin alphabet, it is phonetic for the most part. |
Akkadian | yes | Because it is no longer spoken, we have at least a standard way f representing things. |
Spanish | yes | Because it is no longer spoken, we have at least a standard way f representing things. |
Hebrew | partially yes, but only for consonants unless diacritics given | |
Arabic | partially yes, but only for consonants unless diacritics given | |
English | no | Too many words need to have pronunciation memorized. |
Tibetan | no | Modern Tibetan has evolved to where the script no longer is phonetic. |
Vietnamese | no |
MIT
This is being developed by the folks at ClueSurf, a California-based project for helping humanity master information and computation. Find us on Twitter, LinkedIn, and Facebook. Check out our other GitHub projects as well!