Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Cldr 44.1 #6

Merged
merged 4 commits into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ build
dist
cldr-localenames-*
cldr
cldr-json
.hypothesis
.venv/
junit/
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "language_data/data/cldr-json"]
path = language_data/data/cldr-json
url = [email protected]:unicode-org/cldr-json.git
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,10 @@ These are all extracted from the Unicode [CLDR][] data package, version 40, plus
`language_data` is usually installed as a dependency of `langcodes`, and doesn't make much sense without it. You can `pip install language_data` anyway if you want.

To install the `language_data` package in editable mode, run `poetry install` in the package root. (This is the equivalent of `pip install -e .`, which will hopefully become compatible again soon via PEP 660.)

## Update CLDR data

* Make sure submodules are up to date: `git submodule update --init`
* Download CLDR data from https://cldr.unicode.org/index/downloads/
* Unzip and copy `supplemental/languageInfo.xml` and `supplemental/supplementalData.xml` into `language_data/data`
* `cd language_data && ../.venv/bin/python build_data.py`
16 changes: 11 additions & 5 deletions language_data/build_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,11 +440,17 @@ def build_data():
language_data = read_cldr_name_file(langcode, 'languages')
update_names(names_fwd, language_names_rev, language_data)

script_data = read_cldr_name_file(langcode, 'scripts')
update_names(names_fwd, script_names_rev, script_data)

territory_data = read_cldr_name_file(langcode, 'territories')
update_names(names_fwd, territory_names_rev, territory_data)
try:
script_data = read_cldr_name_file(langcode, 'scripts')
update_names(names_fwd, script_names_rev, script_data)
except FileNotFoundError:
pass

try:
territory_data = read_cldr_name_file(langcode, 'territories')
update_names(names_fwd, territory_names_rev, territory_data)
except FileNotFoundError:
pass

iana_languages, iana_scripts, iana_territories = read_iana_registry_names()
update_names(names_fwd, language_names_rev, iana_languages)
Expand Down
1 change: 1 addition & 0 deletions language_data/data/cldr-json
Submodule cldr-json added at e0981e
21 changes: 15 additions & 6 deletions language_data/data/languageInfo.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ For terms of use, see http://www.unicode.org/copyright.html
<languageMatching>
<languageMatches type="written_new">
<paradigmLocales locales="en en_GB es es_419 pt_BR pt_PT"/>
<matchVariable id="$enUS" value="AS+GU+MH+MP+PR+UM+US+VI"/>
<matchVariable id="$enUS" value="AS+CA+GU+MH+MP+PH+PR+UM+US+VI"/>
<matchVariable id="$cnsar" value="HK+MO"/>
<matchVariable id="$americas" value="019"/>
<matchVariable id="$maghreb" value="MA+DZ+TN+LY+MR+EH"/>
<languageMatch desired="no" supported="nb" distance="1"/> <!-- nonb -->
<languageMatch desired="nb" supported="no" distance="1"/> <!-- nbno -->
<!-- languageMatch desired="ku" supported="ckb" distance="4" oneway="true"/ --> <!-- ku ⇒ ckb -->
<!-- languageMatch desired="ckb" supported="ku" percent="8" oneway="true"/ --> <!-- ckb ⇒ ku -->
<languageMatch desired="hr" supported="bs" distance="4"/> <!-- hr ⇒ bs -->
Expand All @@ -38,18 +38,23 @@ For terms of use, see http://www.unicode.org/copyright.html
<languageMatch desired="ach" supported="en" distance="30" oneway="true"/> <!-- Acoli (Southern Luo dialect in Uganda): ach ⇒ en -->
<languageMatch desired="af" supported="nl" distance="20" oneway="true"/> <!-- Afrikaans: af ⇒ nl -->
<languageMatch desired="ak" supported="en" distance="30" oneway="true"/> <!-- Akan: ak ⇒ en -->
<languageMatch desired="am" supported="en" distance="30" oneway="true"/> <!-- Amharic ⇒ English -->
<languageMatch desired="ay" supported="es" distance="20" oneway="true"/> <!-- Aymara: ay ⇒ es -->
<languageMatch desired="az" supported="ru" distance="30" oneway="true"/> <!-- Azerbaijani: az ⇒ ru -->
<languageMatch desired="bal" supported="ur" distance="20" oneway="true"/> <!-- Baluchi ⇒ Urdu -->
<languageMatch desired="be" supported="ru" distance="20" oneway="true"/> <!-- Belarusian: be ⇒ ru -->
<languageMatch desired="bem" supported="en" distance="30" oneway="true"/> <!-- Bemba (Zambia): bem ⇒ en -->
<languageMatch desired="bh" supported="hi" distance="30" oneway="true"/> <!-- Bihari languages (gets canonicalized to bho): bh ⇒ hi -->
<languageMatch desired="bn" supported="en" distance="30" oneway="true"/> <!-- Bangla: bn ⇒ en -->
<languageMatch desired="bo" supported="zh" distance="20" oneway="true"/> <!-- Tibetan ⇒ Chinese -->
<languageMatch desired="br" supported="fr" distance="20" oneway="true"/> <!-- Breton: br ⇒ fr -->
<languageMatch desired="ca" supported="es" distance="20" oneway="true"/> <!-- Catalan ⇒ Spanish -->
<languageMatch desired="ceb" supported="fil" distance="30" oneway="true"/> <!-- Cebuano: ceb ⇒ fil -->
<languageMatch desired="chr" supported="en" distance="20" oneway="true"/> <!-- Cherokee: chr ⇒ en -->
<languageMatch desired="ckb" supported="ar" distance="30" oneway="true"/> <!-- Sorani Kurdish: ckb ⇒ ar -->
<languageMatch desired="co" supported="fr" distance="20" oneway="true"/> <!-- Corsican: co ⇒ fr -->
<languageMatch desired="crs" supported="fr" distance="20" oneway="true"/> <!-- Seselwa Creole French: crs ⇒ fr -->
<languageMatch desired="cs" supported="sk" distance="20"/> <!-- Czech ⇔ Slovak -->
<languageMatch desired="cy" supported="en" distance="20" oneway="true"/> <!-- Welsh: cy ⇒ en -->
<languageMatch desired="ee" supported="en" distance="30" oneway="true"/> <!-- Ewe: ee ⇒ en -->
<languageMatch desired="eo" supported="en" distance="30" oneway="true"/> <!-- Esperanto: eo ⇒ en -->
Expand Down Expand Up @@ -88,9 +93,10 @@ For terms of use, see http://www.unicode.org/copyright.html
<languageMatch desired="lo" supported="en" distance="30" oneway="true"/> <!-- Lao: lo ⇒ en -->
<languageMatch desired="loz" supported="en" distance="30" oneway="true"/> <!-- Lozi: loz ⇒ en -->
<languageMatch desired="lua" supported="fr" distance="30" oneway="true"/> <!-- Luba-Lulua: lua ⇒ fr -->
<languageMatch desired="mai" supported="hi" distance="20" oneway="true"/> <!-- Maithili ⇒ Hindi -->
<languageMatch desired="mfe" supported="en" distance="30" oneway="true"/> <!-- Morisyen: mfe ⇒ en -->
<languageMatch desired="mg" supported="fr" distance="30" oneway="true"/> <!-- Malagasy: mg ⇒ fr -->
<languageMatch desired="mi" supported="en" distance="20" oneway="true"/> <!-- Maori: mi ⇒ en -->
<languageMatch desired="mi" supported="en" distance="20" oneway="true"/> <!-- Māori: mi ⇒ en -->

<!-- CLDR-13625: Macedonian should not fall back to Bulgarian -->
<!-- languageMatch desired="mk" supported="bg" distance="30" oneway="true"/--> <!-- Macedonian: mk ⇒ bg -->
Expand Down Expand Up @@ -137,12 +143,14 @@ For terms of use, see http://www.unicode.org/copyright.html
<languageMatch desired="tt" supported="ru" distance="30" oneway="true"/> <!-- Tatar: tt ⇒ ru -->
<languageMatch desired="tum" supported="en" distance="30" oneway="true"/> <!-- Tumbuka: tum ⇒ en -->
<languageMatch desired="ug" supported="zh" distance="20" oneway="true"/> <!-- Uighur: ug ⇒ zh -->
<languageMatch desired="uk" supported="ru" distance="20" oneway="true"/> <!-- Ukrainian ⇒ Russian -->
<languageMatch desired="ur" supported="en" distance="30" oneway="true"/> <!-- Urdu: ur ⇒ en -->
<languageMatch desired="uz" supported="ru" distance="30" oneway="true"/> <!-- Uzbek: uz ⇒ ru -->
<languageMatch desired="wo" supported="fr" distance="30" oneway="true"/> <!-- Wolof: wo ⇒ fr -->
<languageMatch desired="xh" supported="en" distance="30" oneway="true"/> <!-- Xhosa: xh ⇒ en -->
<languageMatch desired="yi" supported="en" distance="30" oneway="true"/> <!-- Yiddish: yi ⇒ en -->
<languageMatch desired="yo" supported="en" distance="30" oneway="true"/> <!-- Yoruba: yo ⇒ en -->
<languageMatch desired="za" supported="zh" distance="20" oneway="true"/> <!-- Zhuang languages ⇒ Chinese -->
<languageMatch desired="zu" supported="en" distance="30" oneway="true"/> <!-- Zulu: zu ⇒ en -->

<!-- START generated by GenerateLanguageMatches.java: don't manually change -->
Expand Down Expand Up @@ -359,8 +367,10 @@ For terms of use, see http://www.unicode.org/copyright.html
<languageMatch desired="yue" supported="zh" distance="10" oneway="true"/> <!-- Chinese, Cantonese -->
<!-- END generated by GenerateLanguageMatches.java -->
<languageMatch desired="*" supported="*" distance="80"/> <!-- * ⇒ * -->
<languageMatch desired="am_Ethi" supported="en_Latn" distance="10" oneway="true"/>
<languageMatch desired="az_Latn" supported="ru_Cyrl" distance="10" oneway="true"/> <!-- az; Latn ⇒ ru; Cyrl -->
<languageMatch desired="bn_Beng" supported="en_Latn" distance="10" oneway="true"/> <!-- bn; Beng ⇒ en; Latn -->
<languageMatch desired="bo_Tibt" supported="zh_Hans" distance="10" oneway="true"/>
<languageMatch desired="hy_Armn" supported="ru_Cyrl" distance="10" oneway="true"/> <!-- hy; Armn ⇒ ru; Cyrl -->
<languageMatch desired="ka_Geor" supported="en_Latn" distance="10" oneway="true"/> <!-- ka; Geor ⇒ en; Latn -->
<languageMatch desired="km_Khmr" supported="en_Latn" distance="10" oneway="true"/> <!-- km; Khmr ⇒ en; Latn -->
Expand All @@ -382,9 +392,8 @@ For terms of use, see http://www.unicode.org/copyright.html
<languageMatch desired="uz_Latn" supported="ru_Cyrl" distance="10" oneway="true"/> <!-- uz; Latn ⇒ ru; Cyrl -->
<languageMatch desired="yi_Hebr" supported="en_Latn" distance="10" oneway="true"/> <!-- yi; Hebr ⇒ en; Latn -->
<languageMatch desired="sr_Latn" supported="sr_Cyrl" distance="5"/> <!-- sr; Latn ⇒ sr; Cyrl -->
<languageMatch desired="zh_Hans" supported="zh_Hant" distance="15" oneway="true"/> <!-- zh; Hans ⇒ zh; Hant -->
<languageMatch desired="zh_Hant" supported="zh_Hans" distance="19" oneway="true"/> <!-- zh; Hant ⇒ zh; Hans -->
<!-- zh_Hani: Slightly bigger distance than zh_Hant->zh_Hans -->
<languageMatch desired="za_Latn" supported="zh_Hans" distance="10" oneway="true"/>
<!-- zh_Hani: Slightly bigger distance than zh_Hant->zh_Hans was before CLDR-14355 -->
<languageMatch desired="zh_Hani" supported="zh_Hans" distance="20" oneway="true"/>
<languageMatch desired="zh_Hani" supported="zh_Hant" distance="20" oneway="true"/>
<!-- Latin transliterations of some languages, initially from CLDR-13577 -->
Expand Down
Loading
Loading