From 4a0eb7845b8c5cd614a33a333e20f959e4e3c796 Mon Sep 17 00:00:00 2001 From: Peter Edberg <42151464+pedberg-icu@users.noreply.github.com> Date: Tue, 22 Mar 2022 22:11:06 -0700 Subject: [PATCH] CLDR-15398 Fix mistakes (#1785) (#1831) See #1785 (cherry picked from commit 5aca9fd2974ed7e2fb1496beffb7ae28c8a92ffa) Co-authored-by: Ivan Panchenko <39594356+ivan-pan@users.noreply.github.com> --- docs/ldml/tr35.md | 138 +++++++++++++++++++++++----------------------- 1 file changed, 69 insertions(+), 69 deletions(-) diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index f97e6a76cca..a26c6dfccba 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -29,7 +29,7 @@ _This is a draft document which may be updated, replaced, or superseded by other _Please submit corrigenda and other comments with the CLDR bug reporting form [[Bugs](https://cldr.unicode.org/index/bug-reports)]. Related information that is useful in understanding this document is found in the [References](#References). For the latest version of the Unicode Standard see [[Unicode](https://www.unicode.org/versions/latest/)]. For a list of current Unicode Technical Reports see [[Reports](https://www.unicode.org/reports/)]. For more information about versions of the Unicode Standard, see [[Versions](https://www.unicode.org/versions/)]._ ->**_NOTE: The source for the LDML specification has been converted to Github Markdown (GFM) instead of HTML. The formatting is now simpler, but some features — such as formatting for table captions — may not be complete by the release date. Improvements in the formatting for the v39 specification are planned for after the release, but no substantive changes would be made to the content._** +>**_NOTE: The source for the LDML specification has been converted to GitHub Markdown (GFM) instead of HTML. The formatting is now simpler, but some features — such as formatting for table captions — may not be complete by the release date. Improvements in the formatting for the v39 specification are planned for after the release, but no substantive changes would be made to the content._** ## Parts @@ -198,7 +198,7 @@ _**UAX35-C1.**_ An implementation that claims conformance to this specification * For example, an implementation might claim conformance to all LDML features except for _transforms_ and _segments_. 2. Interpret the relevant elements and attributes of LDML documents in accordance with the descriptions in those sections. * For example, an implementation that claims conformance to the date format patterns must interpret the characters in such patterns according to [Date Field Symbol Table](tr35-dates.md#Date_Field_Symbol_Table). -3. Declare which types of CLDR data that it uses. +3. Declare which types of CLDR data it uses. * For example, an implementation might declare that it only uses language names, and those with a _draft_ status of _contributed_ or _approved_. _**UAX35-C2.**_ An implementation that claims conformance to Unicode locale or language identifiers shall: @@ -339,12 +339,12 @@ A [`unicode_locale_id`](#unicode_locale_id) has _canonical syntax_ when: For example, the canonical form of "en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is "en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes `"foo"` and `"bar"` in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification. -NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in [Section 4.1](https://tools.ietf.org/search/bcp47#section-4.1) of BCP47. Here are the considerations that lead to that decision: +NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in [Section 4.1](https://tools.ietf.org/search/bcp47#section-4.1) of BCP 47. Here are the considerations that lead to that decision: * The ordering in Section 4.1 is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required. * Moreover, [Section 4.5](https://tools.ietf.org/search/bcp47#section-4.5) states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.” - * The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP47, especially especially for language matching and language fallback. + * The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback. * Robust implementations will accept the variants in any order, just as they accept extensions in any order. - * A canonical order allows for determination of identity of identifers via string comparison. + * A canonical order allows for determination of identity of identifiers via string comparison. * The ordering in Section 4.1 does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another. * Pure alphabetical order is determinant and simple to implement while the ordering in Section 4.1 is indeterminant, more complex, and provides no significant benefit in modern applications. @@ -378,7 +378,7 @@ Unicode language and locale identifiers inherit the design and the repertoire of * No irregular BCP 47 legacy language tags (marked as “Type: grandfathered” in BCP 47) are allowed (these are all deprecated in BCP 47) * A tag must not start with the subtag "x": thus a _privateuse_ (eg x-abc) can only be after a language subtag, like "und" * It allows for certain semantic additions and constraints: - * Certain codes that are private-use in BCP-47 and ISO are given semantics by LDML + * Certain codes that are private-use in BCP 47 and ISO are given semantics by LDML * Each macrolanguage has an identified primary encompassed language, which is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing (as allowed by BCP 47, see [Section 4.1.2](https://tools.ietf.org/search/bcp47#section-4.1.2)) * It allows certain syntax for backwards compatibility (not BCP 47-compatible): * The "\_" character for field separator characters, as well as the "-" used in [[BCP47](#BCP47)] (however, the canonical form is with "-") @@ -396,7 +396,7 @@ The different identifiers can be converted to one another as described in this s ##### BCP 47 Language Tag to Unicode BCP 47 Locale Identifier -A valid [[BCP47](#BCP47)] language tag can be converted to a valid Unicode BCP 47 locale identifier according to [Annex C. LocaleId Canonicalization](#LocaleId_Canonicalization) +A valid [[BCP47](#BCP47)] language tag can be converted to a valid Unicode BCP 47 locale identifier according to [Annex C. LocaleId Canonicalization](#LocaleId_Canonicalization). The result is a Unicode BCP 47 locale identifier, in canonical form. It is both a BCP 47 language tag and a Unicode locale identifier. Because the process maps from all BCP 47 language tags into a subset of BCP 47 language tags, the format changes are not reversible, much as a lowercase transformation of the string “McGowan” is not reversible. @@ -407,7 +407,7 @@ _Examples_ | `en-US` | `en-US` | no changes | | `iw-FX` | `he-FR` | BCP 47 canonicalization | | `cmn-TW` | `zh-TW` | language alias | -| `zh-cmn-TW` | `zh-TW` | BCP 47 canonicalization , then language alias | +| `zh-cmn-TW` | `zh-TW` | BCP 47 canonicalization, then language alias | | `sr-CS` | `sr-RS` | territory alias | | `sh` | `sr-Latn` | multiple replacement subtags | | `sh-Cyrl` | `sr-Cyrl` | no replacement with multiple replacement subtags | @@ -451,7 +451,7 @@ _Examples:_ ##### Truncation -BCP47 requires that implementations allow for language tags of at least 35 characters, in [Section 4.1.1](https://tools.ietf.org/search/bcp47#section-4.4.1). +BCP 47 requires that implementations allow for language tags of at least 35 characters, in [Section 4.1.1](https://tools.ietf.org/search/bcp47#section-4.4.1). To allow for use of extensions, CLDR extends that minimum to 255 for Unicode locale identifiers. Theoretically, a language tag could be far longer, due to the possibility of a large number of variants and extensions. In practice, the typical size of a locale or language identifier will be much smaller, so implementations can optimize for smaller sizes, as long as there is an escape mechanism allowing for up to 255. @@ -546,7 +546,7 @@ Unicode identifiers give specific semantics to certain Unicode Script values. Fo The private use subtags listed as **excluded** in _Section 3.5.3 [Private Use Codes](#Private_Use_Codes)_ will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications. -#### `unicode_region_subtag` (also known as a _Unicode region code,_ or _a Unicode territory code) +#### `unicode_region_subtag` (also known as a _Unicode region code,_ or a _Unicode territory code_) Subtags in the region.xml file (see _Section 3.11 [Validity Data](#Validity_Data)_). These are based on [[BCP47](#BCP47)] subtag values marked as **Type: region** @@ -659,7 +659,7 @@ See also _Section 3.5.1 [Unknown or Invalid Identifiers](#Unknown_or_Invalid_Ide The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [[RFC6067](#RFC6067)] and extension 't' for transformed content [[RFC6497](#RFC6497)]. The Unicode BCP 47 extension data defines the complete list of valid subtags. -These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the rule `extension` in the [[BCP47](#BCP47)] +These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the rule `extension` in the [[BCP47](#BCP47)]. **The -u- Extension.** The syntax of 'u' extension subtags is defined by the rule `unicode_locale_extensions` in [Section 3.2 Unicode locale identifier](#Unicode_locale_identifier), except the separator of subtags `sep` must be always hyphen '-' when the extension is used as a part of BCP 47 language tag. @@ -778,7 +778,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other "h24" Hour system using 1–24; corresponds to 'k' in pattern -A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option. Specifying "lb" in a locale identifier overrides the locale‘s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml. +A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option. Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml. "lb" Line break style "strict" @@ -852,7 +852,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other "tz"
(timezone) Time zone Unicode short time zone IDs -

Short identifiers defined in terms of a TZ time zone database [Olson] identifier in the file common/bcp47/timezone.xml file, plus a few extra values.

+

Short identifiers defined in terms of a TZ time zone database [Olson] identifier in the common/bcp47/timezone.xml file, plus a few extra values.

For more information, see Section 3.6.3 Time Zone Identifiers.

CLDR provides data for normalizing timezone codes.

@@ -887,13 +887,13 @@ LDML inherits time zone IDs from the tz database [[Olson](#Olson)]. Because thes The short identifiers use UN/LOCODE [[LOCODE](#LOCODE)] (excluding a space character) codes where possible. For example, the short identifier for "America/Los_Angeles" is "uslax" (the LOCODE for Los Angeles, US is "US LAX"). Identifiers of length not equal to 5 are used where there is no corresponding UN/LOCODE, such as "usnavajo" for "America/Shiprock", or "utcw01" for "Etc/GMT+1", so that they do not overlap with future UN/LOCODE. -Although the first two letters of a short identifier may match an ISO 3166 two-letter country code, a user should not assume that the time zone belongs to the country. The first two letters in an identifier of length not equal to 5 has no meaning. Also, the identifiers are stabilized, meaning that they will not change no matter what changes happen in the base standard. So if Hawaii leaves the US and joins Canada as a new province, the short time zone identifier "ushnl" would not change in CLDR even if the UN/LOCODE changes to "cahnl" or something else. +Although the first two letters of a short identifier may match an ISO 3166 two-letter country code, a user should not assume that the time zone belongs to the country. The first two letters in an identifier of length not equal to 5 have no meaning. Also, the identifiers are stabilized, meaning that they will not change no matter what changes happen in the base standard. So if Hawaii leaves the US and joins Canada as a new province, the short time zone identifier "ushnl" would not change in CLDR even if the UN/LOCODE changes to "cahnl" or something else. There is a special code "unk" for an Unknown or Invalid time zone. This can be expressed in the tz database style ID "Etc/Unknown", although it is not defined in the tz database. **Stability of Time Zone Identifiers** -Although the short time zone identifiers are guaranteed to be stable, the preferred IDs in the tz database (as those found in **zone.tab** file) might be changed time to time. For example, "Asia/Culcutta" was replaced with "Asia/Kolkata" and moved to **backward** file in the tz database. CLDR contains locale data using a time zone ID from the tz database as the key, stability of the IDs is cirtical. +Although the short time zone identifiers are guaranteed to be stable, the preferred IDs in the tz database (as those found in **zone.tab** file) might be changed time to time. For example, "Asia/Culcutta" was replaced with "Asia/Kolkata" and moved to **backward** file in the tz database. CLDR contains locale data using a time zone ID from the tz database as the key, stability of the IDs is critical. To maintain the stability of "long" IDs (for those inherited from the tz database), a special rule applied to the `alias` attribute in the `` element for "tz" - the first "long" ID is the CLDR canonical "long" time zone ID. @@ -1080,14 +1080,14 @@ It is strongly recommended that all API methods accept all possible aliases for The subdivision codes designate a subdivision of a country or region. They are called various names, such as a _state_ in the United States, or a _province_ in Canada. The codes in CLDR are based on ISO 3166-2 subdivision codes. The ISO codes have a region code followed by a hyphen, then a suffix consisting of 1..3 ASCII letters or digits. -The CLDR codes are designed to work in a [unicode_locale_id](#unicode_locale_id) (BCP47), and are thus all lowercase, with no hyphen. For example, the following are valid, and mean “English as used in California, USA”. +The CLDR codes are designed to work in a [unicode_locale_id](#unicode_locale_id) (BCP 47), and are thus all lowercase, with no hyphen. For example, the following are valid, and mean “English as used in California, USA”. * en-u-sd-**usca** * en-US-u-sd-**usca** CLDR has additional subdivision codes. These may start with a 3-digit region code or use a suffix of 4 ASCII letters or digits, so they will not collide with the ISO codes. Subdivision codes for unknown values are the region code plus "zzzz", such as "uszzzz" for an unknown subdivision of the US. Other codes may be added for stability. -Like BCP 47, CLDR requires stable codes, which are not guaranteed for ISO 3166-2 (nor have the ISO 3166-2 codes been stable in the past). If an ISO 3166-2 code is removed, it remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2 code is reused (for the same region), then CLDR will define a new equivalent code using these a 4-character suffixes. +Like BCP 47, CLDR requires stable codes, which are not guaranteed for ISO 3166-2 (nor have the ISO 3166-2 codes been stable in the past). If an ISO 3166-2 code is removed, it remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2 code is reused (for the same region), then CLDR will define a new equivalent code using these as 4-character suffixes. ##### 3.6.5.1 Validity @@ -1141,7 +1141,7 @@ In the transformed content 't' data file, the `name` attribute in a `` elem ```xml - + ``` The data above indicates: @@ -1158,7 +1158,7 @@ The attributes are: **description** -> A description of the name, with all and only that information necessary to distinguish one name from | American Library others with which it might be confused. Descriptions are not intended to provide general background information. +> A description of the name, with all and only that information necessary to distinguish one name from others with which it might be confused. Descriptions are not intended to provide general background information. **since** @@ -1176,7 +1176,7 @@ LDML version before 1.7.2 used slightly different syntax for variant subtags and #### 3.8.1 Old Locale Extension Syntax -LDML 1.7 or older specification used different syntax for representing unicode locale extensions. The previous definition of Unicode locale extensions had the following structure: +LDML 1.7 or older specification used different syntax for representing Unicode locale extensions. The previous definition of Unicode locale extensions had the following structure: | | EBNF | | ----------------------------- | ---- | @@ -1221,7 +1221,7 @@ Old LDML specification allowed codes other than registered [[BCP47](#BCP47)] var | `NYNORSK` | Nynorsk, variant of "`no`" Norwegian. Use primary language subtag "`nn`" to indicate this. | | `POSIX` | POSIX variation of locale data. Use Unicode locale extension `-u-va-posix` to indicate this. | | `POLYTONI` | Polytonic, variant of "`el`" Greek. Use [[BCP47](#BCP47)] variant subtag `polyton` to indicate this. | -| `SAAHO` | The Saaho variant of Afar. Use primary language subtag "`ssy`" to indicated this. | +| `SAAHO` | The Saaho variant of Afar. Use primary language subtag "`ssy`" to indicate this. | When converting to old syntax, the Unicode locale extension "`-u-va-posix`" should be converted to the "`POSIX`" variant, _not_ to old extension syntax like "`@va=posix`". This is an exception: The other mappings above should not be reversed. @@ -1236,7 +1236,7 @@ Examples: #### 3.8.3 Relation to OpenI18n -The locale id format generally follows the description in the _OpenI18N Locale Naming Guideline_ [[NamingGuideline](#NamingGuideline)], with some enhancements. The main differences from the those guidelines are that the locale id: +The locale id format generally follows the description in the _OpenI18N Locale Naming Guideline_ [[NamingGuideline](#NamingGuideline)], with some enhancements. The main differences from those guidelines are that the locale id: 1. does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8, although that can be transcoded to other encodings as well.) 2. adds the ability to have a variant, as in Java @@ -1275,7 +1275,7 @@ In addition, exceptions are often caught at a higher level; they do not end up b People have very slippery notions of what distinguishes a language code versus a locale code. The problem is that both are somewhat nebulous concepts. -In practice, many people use [[BCP47](#BCP47)] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [[BCP47](#BCP47)] includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives an [[BCP47](#BCP47)] code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" versus "\_" (for example, _zh-TW_ for language code, _zh_TW_ for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "\_" as equivalent when interpreting either one on input. +In practice, many people use [[BCP47](#BCP47)] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [[BCP47](#BCP47)] includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives a [[BCP47](#BCP47)] code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" versus "\_" (for example, _zh-TW_ for language code, _zh_TW_ for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "\_" as equivalent when interpreting either one on input. Another reason for the conflation of these codes is that _very_ little data in most systems is distinguished by region alone; currency codes and measurement systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really does not make much sense. If people see the sentence "You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far more closely associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format "2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important — but those are different in kind than other language differences between regions. @@ -1293,10 +1293,10 @@ Criteria for what makes a written language should be purely pragmatic; _what wou So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication: -2. "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader." -3. "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader." +2. "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader." +3. "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader." -Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first versus last name sorting in the list, but clearly the first list was _not_ acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there limits on what is acceptable English, and "2003年3月20日", for example, is _not_. +Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first versus last name sorting in the list, but clearly the first list was _not_ acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there are limits on what is acceptable English, and "2003年3月20日", for example, is _not_. Note that the language of locale data may differ from the language of localized software or web sites, when those latter are not localized into the user's preferred language. In such cases, the kind of incongruous juxtapositions described above may well appear, but this situation is usually preferable to forcing unfamiliar date or number formats on the user as well. @@ -1310,7 +1310,7 @@ Hybrid locales have intermixed content from 2 (or more) languages, often with on Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock, revint précipitamment vers sa petite maison située au numéro 19 de Königstrasse, l’une des plus anciennes rues du vieux quartier de Hambourg… -While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanglish document, and a Spanish document that has some passages quoted in English. Fine-grained tagging doesn't handle grammatical combinations like Denglisch “[​gedownloadet](https://www.duden.de/rechtschreibung/downloaden)”, which is neither English nor German — similarly the Franglais “[downloadé](https://www.le-dictionnaire.com/definition.php?mot=downloader)”. More importantly, it doesn’t work for the very common use case for a [unicode_locale_id](#unicode_locale_id): _locale selection_. +While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanglish document, and a Spanish document that has some passages quoted in English. Fine-grained tagging doesn't handle grammatical combinations like Tanglish “Enna matteru?” (_What’s the matter?_), which is neither standard Tamil nor standard English. More importantly, it doesn’t work for the very common use case for a [unicode_locale_id](#unicode_locale_id): _locale selection_. To communicate requests for localized content and internationalization services, locales are used. When people pick a language from a menu, internally they are picking a locale (en-GB, es-419, etc.). To allow an application to support Spanglish or Hinglish locale selection, [unicode_locale_id](#unicode_locale_id)s can represent hybrid locales using the T extension key-value 'h0-hybrid'. (For more information on the T extension, see _Section 3.7 [Unicode BCP 47 T Extension](#t_Extension)._) @@ -1365,7 +1365,7 @@ The directory [common/validity](https://github.com/unicode-org/cldr/releases/tag * Note that some two-letter region codes are macroregions, and (in the future) some three-digit codes may be regular codes. * For details as to which regions are contained within which macroregions, see the `` element of the supplemental data. * **deprecated** — codes that should not be used. The `` element in the supplementalMeta file contains more information about these codes, and which codes should be used instead. -* **private_use** — codes that, for CLDR, are considered private use. Note that some private-use codes in a source standard such as BCP47 have defined CLDR semantics, and are considered regular codes. For more information, see _Section 3.5.3 [Private Use Codes](#Private_Use_Codes)._ +* **private_use** — codes that, for CLDR, are considered private use. Note that some private-use codes in a source standard such as BCP 47 have defined CLDR semantics, and are considered regular codes. For more information, see _Section 3.5.3 [Private Use Codes](#Private_Use_Codes)._ * **reserved** — codes that are private use in a source standard, but are reserved for future use as regular codes by CLDR. The list of subtags for each idStatus use a compact format as a space-delimited list of StringRanges, as defined in _Section [5.3.4 String Range](#String_Range)._ The separator for each StringRange is a "~". @@ -1414,7 +1414,7 @@ If a type and key are supplied in the locale id, then logically the chain from t Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_IE" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance. At the script or region level, the "primary" child locale will be empty, since its parent will contain all of the appropriate resources for it. For more information see _CLDR Information: Section 9.3 [Default Content](tr35-info.md#Default_Content)._ -Certain data items depend only on the region specified in a locale id (by a [unicode_region_subtag](#unicode_region_subtag_validity) or an “rg” [Region Override](#RegionOverride) key) , and are obtained from supplemental data rather than through locale resources. For example: +Certain data items depend only on the region specified in a locale id (by a [unicode_region_subtag](#unicode_region_subtag_validity) or an “rg” [Region Override](#RegionOverride) key), and are obtained from supplemental data rather than through locale resources. For example: * The currency for the specified region (see [Supplemental Currency Data](tr35-numbers.md#Supplemental_Currency_Data)) * The measurement system for the specified region (see [Measurement System Data](tr35-general.md#Measurement_System_Data)) @@ -1439,7 +1439,7 @@ lang_region (aliases to lang_script_region) There are actually two different kinds of inheritance fallback: _resource bundle lookup_ and _resource item lookup_. For the former, a process is looking to find the first, best resource bundle it can; for the later, it is fallback within bundles on individual items, like the translated name for the region "CN" in Breton. -These are closely related, but distinct, processes. They are illustrated in the table [Lookup Differences](#Lookup-Differences), where "key" stands for zero or more key/type pairs. Logically speaking, when looking up an item for a given locale, you first do a resource bundle lookup to find the best bundle for the locale, then you do a inherited item lookup starting with that resource bundle. +These are closely related, but distinct, processes. They are illustrated in the table [Lookup Differences](#Lookup-Differences), where "key" stands for zero or more key/type pairs. Logically speaking, when looking up an item for a given locale, you first do a resource bundle lookup to find the best bundle for the locale, then you do an inherited item lookup starting with that resource bundle. The table [Lookup Differences](#Lookup-Differences) uses the naïve resource bundle lookup for illustration. More sophisticated systems will get far better results for resource bundle lookup if they use the algorithm described in _Section 4.4 [Language Matching](#LanguageMatching)_. That algorithm takes into account both the user’s desired locale(s) and the application’s supported locales, in order to get the best match. @@ -1498,9 +1498,9 @@ For the purposes of CLDR, everything with the `` dtd is treated logically -_Both the resource bundle inheritance and the inherited item inheritance use the parentLocale data, where available, instead of simple trunctation._ +_Both the resource bundle inheritance and the inherited item inheritance use the parentLocale data, where available, instead of simple truncation._ -The fallback is a bit different for these two cases; internal aliases and keys are are not involved in the bundle lookup, and the default locale is not involved in the item lookup. If the default-locale were used in the resource-item lookup, then strange results will occur. For example, suppose that the default locale is Swedish, and there is a Nama locale but no specific inherited item for collation. If the default-locale were used in resource-item lookup, it would produce odd and unexpected results for Nama sorting. +The fallback is a bit different for these two cases; internal aliases and keys are not involved in the bundle lookup, and the default locale is not involved in the item lookup. If the default-locale were used in the resource-item lookup, then strange results will occur. For example, suppose that the default locale is Swedish, and there is a Nama locale but no specific inherited item for collation. If the default-locale were used in resource-item lookup, it would produce odd and unexpected results for Nama sorting. The default locale is not even always used in resource bundle inheritance. For the following services, the fallback is always directly to the root locale rather than through default locale. @@ -1548,7 +1548,7 @@ For example, with /currency [@type="CVE"], the decimal symbol for almost all loc The following attributes use lateral inheritance for **all elements** with the DTD root = ldml, except where otherwise noted. The process is applied recursively. -| Atttribute | Fallback | Exception Elements | +| Attribute | Fallback | Exception Elements | | ---------- | -------------------------------------- | --------------------------- | | alt | __no alt attribute__ | _none_ | | case | "nominative" → ∅ | caseMinimalPairs | @@ -1737,7 +1737,7 @@ Two LDML element chains are _equivalent_ when they would be identical if all att * `//ldml/localeDisplayNames/languages/language[@type="ar"]` * `//ldml/localeDisplayNames/languages/language[@type="ar"][@draft="unconfirmed"]` -For any locale ID, an _locale chain_ is an ordered list starting with the root and leading down to the ID. For example: +For any locale ID, a _locale chain_ is an ordered list starting with the root and leading down to the ID. For example: > @@ -1793,7 +1793,7 @@ Any data can be added to that file, and the status will all be `draft="unconfirm ``` -However, normally the draft attributes should be canonicalized, which means they are pushed down to leaf nodes as described in _[Section 5.6 Canonical Form](#Canonical_Form)_. If an LDML file does has draft attributes that are not on leaf nodes, the file should be interpreted as if it were the canonicalized version of that file. +However, normally the draft attributes should be canonicalized, which means they are pushed down to leaf nodes as described in _[Section 5.6 Canonical Form](#Canonical_Form)_. If an LDML file does have draft attributes that are not on leaf nodes, the file should be interpreted as if it were the canonicalized version of that file. More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are simply formalizations of what is in LDML already. Item 3 adds the new element. @@ -1894,7 +1894,7 @@ Examples of lookup for Chinese collation types. Note: ``` -For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default value is the identifier itself whenever if no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display name is simply 'QA'. +For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default value is the identifier itself whenever no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display name is simply 'QA'. #### 4.2.6 Inheritance vs Related Information @@ -1973,9 +1973,9 @@ So looking up "zh_TW" returns "zh_Hant_TW", while looking up "zh" returns "zh_Ha In more detail, the data is designed to be used in the following operations. -Note that as of CLDR v24, any field present in the 'from' field, is also present in the 'to' field, so an input field will not change in "Add Likely Subtags" operation. The data and operations can also be used with language tags using [[BCP47](#BCP47)] syntax, with the appropriate changes. In addition, certain common 'denormalized' language subtags such as 'iw' (for 'he') may occur in both the 'from' and 'to' fields. This allows for implementations that use those denormalized subtags to use the data with only minor changes to the operations. +Note that as of CLDR v24, any field present in the 'from' field is also present in the 'to' field, so an input field will not change in "Add Likely Subtags" operation. The data and operations can also be used with language tags using [[BCP47](#BCP47)] syntax, with the appropriate changes. In addition, certain common 'denormalized' language subtags such as 'iw' (for 'he') may occur in both the 'from' and 'to' fields. This allows for implementations that use those denormalized subtags to use the data with only minor changes to the operations. -An implementation may choose exclude language tags with the language subtag "und" from the following operation. In such a case, only the canonicalization is done. An implementation can declare that it is doing the exclusion, or can take a parameter that controls whether or not to do it. +An implementation may choose to exclude language tags with the language subtag "und" from the following operation. In such a case, only the canonicalization is done. An implementation can declare that it is doing the exclusion, or can take a parameter that controls whether or not to do it. _**Add Likely Subtags:**_ _Given a source locale X, to return a locale Y where the empty subtags have been filled in by the most likely subtags._ This is written as X ⇒ Y ("X maximizes to Y"). @@ -1985,18 +1985,18 @@ This operation is performed in the following way. 1. **Canonicalize.** 1. Make sure the input locale is in canonical form: uses the right separator, and has the right casing. - 2. Replace any deprecated subtags with their canonical values using the `` data in supplemental metadata. Use the first value in the replacement list, if it exists. Language tag replacements may have multiple parts, such as "sh" ➞ "sr_Latn" or mo" ➞ "ro_MD". In such a case, the original script and/or region are retained if there is one. Thus "sh_Arab_AQ" ➞ "sr_Arab_AQ", not "sr_Latn_AQ". + 2. Replace any deprecated subtags with their canonical values using the `` data in supplemental metadata. Use the first value in the replacement list, if it exists. Language tag replacements may have multiple parts, such as "sh" ➞ "sr_Latn" or "mo" ➞ "ro_MD". In such a case, the original script and/or region are retained if there is one. Thus "sh_Arab_AQ" ➞ "sr_Arab_AQ", not "sr_Latn_AQ". 3. If the tag is a legacy language tag (marked as “Type: grandfathered” in BCP 47; see `` in the supplemental data), then return it. 4. Remove the script code 'Zzzz' and the region code 'ZZ' if they occur. 5. Get the components of the cleaned-up source tag _(languages, scripts,_ and _regions_), plus any variants and extensions. -2. **Lookup.** Lookup each of the following in order, and stop on the first match: +2. **Lookup.** Look up each of the following in order, and stop on the first match: 1. _languages_scripts_regions_ 2. _languages_regions_ 3. _languages_scripts_ 4. __languages__ 5. und\__scripts_ 3. **Return** - 1. If there is no match,either return + 1. If there is no match, either return 1. an error value, or 2. the match for "und" (in APIs where a valid language tag is required). 2. Otherwise there is a match = _languagem_scriptm_regionm_ @@ -2009,8 +2009,8 @@ _Example1:_ * Input is ZH-ZZZZ-SG. * Normalize to zh_SG. -* Lookup in table. No match. -* Lookup zh, and get the match (zh_Hans_CN). Substitute SG, and return zh_Hans_SG. +* Look up in table. No match. +* Look up zh, and get the match (zh_Hans_CN). Substitute SG, and return zh_Hans_SG. To find the most likely language for a country, or language for a script, use "und" as the language subtag. For example, looking up "und_TW" returns zh_Hant_TW. @@ -2076,7 +2076,7 @@ hr-Latn hr ``` -Language matching is used to find the best supported locale ID given a requested list of languages. The requested list could come from different sources, such as such as the user's list of preferred languages in the OS Settings, or from a browser Accept-Language list. For example, if my native tongue is English, I can understand Swiss German and German, my French is rusty but usable, and Italian basic, ideally an implementation would allow me to select {gsw, de, fr} as my preferred list of languages, skipping Italian because my comprehension is not good enough for arbitrary content. +Language matching is used to find the best supported locale ID given a requested list of languages. The requested list could come from different sources, such as the user's list of preferred languages in the OS Settings, or from a browser Accept-Language list. For example, if my native tongue is English, I can understand Swiss German and German, my French is rusty but usable, and Italian basic, ideally an implementation would allow me to select {gsw, de, fr} as my preferred list of languages, skipping Italian because my comprehension is not good enough for arbitrary content. Language Matching can also be used to get fallback data elements. In many cases, there may not be full data for a particular locale. For example, for a Breton speaker, the best fallback if data is unavailable might be French. That is, suppose we have found a Breton bundle, but it does not contain translation for the key "CN" (for the country China). It is best to return "chine", rather than falling back to the value default language such as Russian and getting "Китай". The language matching data can be used to get the closest fallback locales (of those supported) to a given language. @@ -2088,7 +2088,7 @@ When such fallback is used for inherited item lookup, the normal order of inheri That is, we first look in **nb-NO**. If there is no value for **P** there, then we look in **nb**. If there is no value for **P** there, we return the value for **P** in root (or a code value, if there is nothing there). Remember that if there is an `alias` element along this path, then the lookup may restart with a different path in **nb-NO** (or another locale). -However, suppose that **nb-NO** has the fallback values **[nn da sv en]**, derived from language matching. In that case, an implementation _may_ progressively lookup each of the listed locales, with the appropriate substitutions, returning the first value that is not found in **root**. This follows roughly the following pseudocode: +However, suppose that **nb-NO** has the fallback values **[nn da sv en]**, derived from language matching. In that case, an implementation _may_ progressively look up each of the listed locales, with the appropriate substitutions, returning the first value that is not found in **root**. This follows roughly the following pseudocode: ```c value = lookup(P, nb-NO); if (locationFound != root) return value; @@ -2133,7 +2133,7 @@ To find the matching distance MD between any two languages, perform the followin 3. For each subtag in {language, script, region} 1. If respective subtags in each language tag are identical, remove the subtag from each (logically) and continue. 2. Traverse the languageMatching data until a match is found. - * * matches any field. + * \* matches any field. * If the oneway flag is false, then the match is symmetric; otherwise only match one direction. * For region matching, use the mechanisms in **Section 4.4.1 [Enhanced Language Matching](#EnhancedLanguageMatching)**. 3. Add the `distance` attribute value to MD. @@ -2205,7 +2205,7 @@ _Example:_ … ``` -The **matchVariable** allows for a rule to matche to multiple regions, as illustrated by **\$maghreb**. The syntax is simple: it allows for + for _union_ and - for _set difference_, but no precedence. So A+B-A+D is interpreted as (((A+B)-A)+D), not as (A+B)-(A+D). The variable **id** has a value of the form [$][a-zA-Z0-9]+. If $X is defined, then $!X automatically means all those regions that are not in $X. +The **matchVariable** allows for a rule to match to multiple regions, as illustrated by **\$maghreb**. The syntax is simple: it allows for + for _union_ and - for _set difference_, but no precedence. So A+B-A+D is interpreted as (((A+B)-A)+D), not as (A+B)-(A+D). The variable **id** has a value of the form [$][a-zA-Z0-9]+. If $X is defined, then $!X automatically means all those regions that are not in $X. When the set is interpreted, then macrolanguages are (logically) transformed into a list of their contents, so “053+GB” → “AU+GB+NF+NZ”. This is done recursively, so 009 → “053+054+057+061+QO” → “AU+NF+NZ+FJ+NC+PG+SB +VU...”. Note that we use 019 for all of the Americas in the variables above, because en-US should be in the same cluster as es-419 and its contents. @@ -2332,7 +2332,7 @@ This element is designed to allow for arbitrary additional annotation and data t ##### 5.1.1.1 Sample Special Elements -The elements in this section are _**not**_ part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for application-specific data to be stored in the Common Locale Repository. They may change or be removed future versions of this document, and are present her more as examples of how to extend the format. (Some of these items may move into a future version of the Locale Data Markup Language specification.) +The elements in this section are _**not**_ part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for application-specific data to be stored in the Common Locale Repository. They may change or be removed in future versions of this document, and are present here more as examples of how to extend the format. (Some of these items may move into a future version of the Locale Data Markup Language specification.) * [https://unicode.org/cldr/dtd/1.1/ldmlICU.dtd](https://unicode.org/cldr/dtd/1.1/ldmlICU.dtd) * [https://unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd](https://unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd) @@ -2400,7 +2400,7 @@ Consider the following example in root: ``` -If the locale "de_DE" is being accessed for a month name for format/abbreviated, then a resource bundle at "de_DE" will be searched for a resource element at the that path. If not found there, then the resource bundle at "de" will be searched, and so on. When the alias is found in root, then the search is restarted, but searching for format/**wide** element instead of format/abbreviated. +If the locale "de_DE" is being accessed for a month name for format/abbreviated, then a resource bundle at "de_DE" will be searched for a resource element at that path. If not found there, then the resource bundle at "de" will be searched, and so on. When the alias is found in root, then the search is restarted, but searching for format/**wide** element instead of format/abbreviated. If the `path` attribute is present, then its value is an [[XPath](#XPath)] that points to a different node in the tree. For example: @@ -2519,7 +2519,7 @@ Many elements can have a display name. This is a translated name that can be pre ``` -Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [[Data Formats](#DataFormats)]. +Where present, the display names must be unique; that is, two distinct codes would not get the same display name. (There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [[Data Formats](#DataFormats)]. #### 5.1.4 Escaping Characters @@ -2558,9 +2558,9 @@ The `draft` attribute should only occur on "leaf" elements, and is deprecated el #### 5.2.3 Attribute alt -This attribute labels an alternative value for an element. The value is a _descriptor_ indicates what kind of alternative it is, and takes one of the following +This attribute labels an alternative value for an element. The value is a _descriptor_ that indicates what kind of alternative it is, and takes one of the following -* `variantname` meaning that the value is a variant of the normal value, and may be used in its place in certain circumstances. If a variant value is absent for a particular locale, the normal value is used. The variant mechanism should only be used when such a fallback is acceptable. +* `variantname` means that the value is a variant of the normal value, and may be used in its place in certain circumstances. If a variant value is absent for a particular locale, the normal value is used. The variant mechanism should only be used when such a fallback is acceptable. * `proposed`, optionally followed by a number, indicating that the value is a proposed replacement for an existing value. * `variantname-proposed`, optionally followed by a number, indicating that the value is a proposed replacement variant value. @@ -2713,11 +2713,11 @@ Anything else following a backslash is mapped to itself, except the property syn Any code point formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \\x, \\u and \\U escapes create literal code points. (In contrast, Java treats Unicode escapes as just a way to represent arbitrary code points in an ASCII source file, and any resulting code points are _**not**_ tagged as literals.) -Unicode property sets are defined as described as described in _UTS #18: Unicode Regular Expressions_ [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)], Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [[ICUUnicodeSet](#ICUUnicodeSet)]. +Unicode property sets are defined as described in _UTS #18: Unicode Regular Expressions_ [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)], Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [[ICUUnicodeSet](#ICUUnicodeSet)]. ##### 5.3.3.2 Unicode Properties -Briefly, Unicode property sets are specified by any Unicode property and a value of that property, such as **[:General_Category=Letter:]**. for Unicode letters or **\\p\{uppercase}** is the set of upper case letters in Unicode. The property names are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see [[UAX44](https://unicode.org/reports/tr41/#UAX44)]. The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of `"="`. For example, you can match letters by using the POSIX-style syntax: +Briefly, Unicode property sets are specified by any Unicode property and a value of that property, such as **[:General_Category=Letter:]** for Unicode letters or **\\p\{uppercase}** for the set of upper case letters in Unicode. The property names are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see [[UAX44](https://unicode.org/reports/tr41/#UAX44)]. The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of `"="`. For example, you can match letters by using the POSIX-style syntax: **[:General_Category=Letter:]** @@ -2905,7 +2905,7 @@ Note that there was one case that had to be corrected in order to make this true 8. All attributes with defaulted values are suppressed. 9. The draft and `alt="proposed.*"` attributes are only on leaf elements. 10. The tzid are canonicalized in the following way: - * All tzids as of as CLDR 1.1 (2004.06.08) in zone.tab are canonical. + * All tzids as of CLDR 1.1 (2004.06.08) in zone.tab are canonical. * After that point, the first time a tzid is introduced, that is the canonical form. That is, new IDs are added, but existing ones keep the original form. The _TZ_ timezone database keeps a set of equivalences in the "backward" file. These are used to map other tzids to the canonical form. For example, when `America/Argentina/Catamarca` was introduced as the new name for the previous `America/Catamarca` , a link was added in the backward file. @@ -2991,11 +2991,11 @@ There is additional information in the attributeValueValidity.xml file that is u $_bcp47_cu ``` -The element values may be literals, regular expressions, or variables (some of which are set programmatically according to other CLDR data, such as the above. However, the information as this point does not cover all attribute values, is used only for testing, and should not be used in implementations since the structure may change without notice. +The element values may be literals, regular expressions, or variables (some of which are set programmatically according to other CLDR data, such as the above). However, the information at this point does not cover all attribute values, is used only for testing, and should not be used in implementations since the structure may change without notice. #### 5.7.1 Attribute Value Constraints -The following are constraints on the attribute values. Note: in future versions, the format may change, and/or the constaints may be tightened. +The following are constraints on the attribute values. Note: in future versions, the format may change, and/or the constraints may be tightened. | Constraint | Comments | | ------------------------- | -------- | @@ -3172,7 +3172,7 @@ The `` element was introduced in CLDR 21. The values for ### A.13 Deprecated subelements of `` -* `` and ``: Replaced with `` and ``. +* `` and ``: Replaced with `` and ``. ### A.14 Element cp @@ -3349,7 +3349,7 @@ The `languageAlias`, `scriptAlias`, `territoryAlias`, and `variantAlias` element > Note: in the following discussion, the separator '-' is used. That is also used in examples of XML alias data, even though for compatibility reasons that alias data actually uses '\_' as a separator. The processing can also be applied to syntax while maintaining the separator '\_', _mutatis mutandis_. CLDR also uses “territory” and “region” interchangeably. -> Also note that the discussion of canonicalization assumes BCP47 +> Also note that the discussion of canonicalization assumes BCP 47 > input data. If input data is a CLDR or ICU locale ID such > as `en_US_POSIX`, a conversion step must be done prior to > canonicalization. @@ -3359,7 +3359,7 @@ The `languageAlias`, `scriptAlias`, `territoryAlias`, and `variantAlias` element #### 1. Multimap interpretation -Interpret each languageId as a multimap from a _fieldId_ (language, script, region, variants) to a ** sorted set** of field values. +Interpret each languageId as a multimap from a _fieldId_ (language, script, region, variants) to a **sorted set** of field values. _Examples:_ @@ -3381,7 +3381,7 @@ _Examples:_ For the `languageAlias` elements, the _type_ and _replacements_ are languageIds. -For the script-, territory- (aka region), and variant- Alias elements, the type and replacements are interpreted as a languageIds, _after_ prefixing with “und-”. Thus +For the script-, territory- (aka region), and variant- Alias elements, the type and replacements are interpreted as a languageId, _after_ prefixing with “und-”. Thus ```xml @@ -3401,7 +3401,7 @@ A rule matches a source if and only for all fields, each _source_ field ⊇ _typ _Examples:_ -`source=“ja-heploc-hepburn”` and `type=”und-hepburn”` +`source="ja-heploc-hepburn"` and `type="und-hepburn"` @@ -3410,7 +3410,7 @@ _Examples:_ so the rule matches the source. (Note that order of variants is immaterial to matching) -`source=“ja-hepburn”` and `type=”und-hepburn-heploc”` +`source="ja-hepburn"` and `type="und-hepburn-heploc"`
{ja} ⊇ {}success, und = {}
@@ -3432,15 +3432,15 @@ _Example:_ > source=ja-Latn-fonipa-hepburn-heploc > -> rule =”\ rule ="\ -> replacement="und-alalc97">” +> replacement="und-alalc97">" > -> result=”ja-Latn-alalc97-fonipa” // note that CLDR canonical order of variants is alphabetical +> result="ja-Latn-alalc97-fonipa" // note that CLDR canonical order of variants is alphabetical ##### Territory Exception -If the field = territory, and the replacement.field has more than one value, then look up the most likely territory\* for the base language code (and script, if there is one). If that likely territory is in the list of replacements, use it. Otherwise, use the first territory in the list. +If the field = territory, and the replacement.field has more than one value, then look up the most likely territory for the base language code (and script, if there is one). If that likely territory is in the list of replacements, use it. Otherwise, use the first territory in the list. #### 5. Canonicalizing Syntax @@ -3451,7 +3451,7 @@ To canonicalize the syntax of _source_: * Note: These are only for specialized use. * Casing * Put any script subtag inside unicode_language_id into title case (eg, Hant) - * Put any region subtag inside unicode_language_id int uppercase (eg, DE) + * Put any region subtag inside unicode_language_id into uppercase (eg, DE) * Put all other subtags into lowercase (eg, en, fonipa) * Order * Put any variants into alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa) @@ -3509,7 +3509,7 @@ To canonicalize a given _source_: 2. Else if there is an extlang subtag, then apply Step 3 of BCP 47 [Section 4.5](https://tools.ietf.org/search/bcp47#section-4.5) to remove the extlang subtag (possibly adjusting the language subtag). 1. Don’t apply any of the other canonicalization steps in that section, however. 3. Else if the first subtag is "x", prefix by "und-". - 4. **Note:** there are currently no valid 4-letter primary language subtags. While it is extremely unlikely that BCP47 would ever register them, if so then _languageAlias_ mappings will be supplied for them, mapping to defined CLDR language subtags (from the `idStatus="reserved"` set). + 4. **Note:** there are currently no valid 4-letter primary language subtags. While it is extremely unlikely that BCP 47 would ever register them, if so then _languageAlias_ mappings will be supplied for them, mapping to defined CLDR language subtags (from the `idStatus="reserved"` set). 3. Find the first matching rule in **Alias Rules** (from **Preprocessing**) 1. If there are none, return _source_ 4. Transform _source_ according to that rule
{ja} ⊇ {}success, und = {}