Skip to content

Latest commit

 

History

History
314 lines (242 loc) · 7.84 KB

README.md

File metadata and controls

314 lines (242 loc) · 7.84 KB

Python JavaScript

🌐 iso·gloss

isogloss

ISO 639 and IETF Language Code Lookup Tool

isogloss is a Python–based command–line tool designed for looking up language details based on ISO 639 codes and IETF (BCP-47) language tags. It provides comprehensive information about languages, including their names, native names, and additional details associated with each code or tag.

There is also a web–based version here. The BCP47 parser has some known issues, documented below in the "Errata" section.

Elsewhere, the word isogloss means a boundary line on a map denoting the regional use of a particular linguistic characteristic, but in this case it just seemed to fit.

Features

  • Lookup language details using ISO 639-1, 639-2/B, 639-2/T, or 639-3 codes.
  • Lookup language details by language name.
  • Lookup language details using IETF BCP-47 language tags
    • Examples: en-GB, en-US, sv-SE, zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1, and so on.

Installation

Clone the repository to your local machine:

git clone https://github.com/thunderpoot/isogloss.git

Create a virtual environment and install requirements

python3.11 -m venv venv
source venv/bin/activate
pip install unidecode

Usage

The script can be run directly from the command line. Below are some examples of how to use it:

To look up information by ISO 639 code:

$ isogloss/isogloss.py -c swe
{
  "639-1": "sv",
  "Scope": "Individual",
  "Type": "Living",
  "Native name(s)": "svenska",
  "Other name(s)": "",
  "639-2/T": "swe",
  "639-2/B": "",
  "639-3": "swe",
  "Name(s)": "Swedish"
}

To look up information by language name:

$ isogloss/isogloss.py -n "egyptian arabic"
{
    "Egyptian Arabic": "arz"
}

Example of lookup via native name:

$ isogloss/isogloss.py -n 日本語
{
    "\u65e5\u672c\u8a9e Nihongo": "jpn"
}

Example of multiple results being found:

$ isogloss/isogloss.py -n norwegian
{
    "Norwegian Nynorsk": "nno",
    "Nynorsk, Norwegian": "nno",
    "Bokm\u00e5l, Norwegian": "nob",
    "Norwegian Bokm\u00e5l": "nob",
    "Norwegian": "nor",
    "Norwegian Sign Language": "nsl",
    "Traveller Norwegian": "rmg"
}

Language names are normalised, allowing for case–insensitive and accent–insensitive matching when searching:

$ isogloss/isogloss.py -n espanol
{
    "Judeo-espa\u00f1ol": "lad",
    "espa\u00f1ol": "spa"
}

To look up information by IETF language tag:

$ isogloss/isogloss.py -i fr-FR
{
    "Language": {
        "639-1": "fr",
        "Scope": "Individual",
        "Type": "Living",
        "Native name(s)": "fran\u00e7ais",
        "Other name(s)": "",
        "639-2/T": "fra",
        "639-2/B": "fre",
        "639-3": "fra",
        "Name(s)": "French"
    },
    "Region": "France"
}
$ isogloss/isogloss.py -i zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1
{
    "Primary Language": {
        "639-1": "zh",
        "639-2/B": "chi",
        "639-2/T": "zho",
        "639-3": "zho",
        "Deprecated": false,
        "Name(s)": "Chinese",
        "Native name(s)": "\u4e2d\u6587 Zh\u014dngw\u00e9n; \u6c49\u8bed; \u6f22\u8a9e H\u00e0ny\u01d4",
        "Other name(s)": "",
        "Scope": "Macrolanguage",
        "Type": "Living"
    },
    "Extended Languages": [
        {
            "639-1": "",
            "639-2/B": "",
            "639-2/T": "",
            "639-3": "cmn",
            "Deprecated": false,
            "Name(s)": "Mandarin Chinese",
            "Native name(s)": "",
            "Other name(s)": "",
            "Scope": "Individual",
            "Type": "Living"
        }
    ],
    "Script": "Han (Simplified variant)",
    "Region": "China",
    "Variant": "pinyin",
    "Extension": "ud1-p9t4",
    "Private Use": "x-private1"
}
$ isogloss/isogloss.py -i ar-ajp-apc-apd-Arab-CV-arevela-g-231243-r-sdarre-x-private-x-private1 | jq
{
  "Primary Language": {
    "639-1": "ar",
    "639-2/B": "",
    "639-2/T": "ara",
    "639-3": "ara",
    "Deprecated": false,
    "Name(s)": "Arabic",
    "Native name(s)": "العربية; al'Arabiyyeẗ",
    "Other name(s)": "",
    "Scope": "Macrolanguage",
    "Type": "Living"
  },
  "Extended Languages": [
    {
      "639-1": "",
      "639-2/B": "",
      "639-2/T": "",
      "Deprecated": true,
      "Language Name(s)": "South Levantine Arabic",
      "Language Type": "Living",
      "Native name(s)": "",
      "Other name(s)": "",
      "Scope": "Individual"
    },
    {
      "639-1": "",
      "639-2/B": "",
      "639-2/T": "",
      "639-3": "apc",
      "Deprecated": false,
      "Name(s)": "Levantine Arabic",
      "Native name(s)": "",
      "Other name(s)": "",
      "Scope": "Individual",
      "Type": "Living"
    },
    {
      "639-1": "",
      "639-2/B": "",
      "639-2/T": "",
      "639-3": "apd",
      "Deprecated": false,
      "Name(s)": "Sudanese Arabic",
      "Native name(s)": "",
      "Other name(s)": "",
      "Scope": "Individual",
      "Type": "Living"
    }
  ],
  "Script": "Arabic",
  "Region": "Cabo Verde",
  "Variant": "arevela",
  "Extension": "g-231243-r-sdarre",
  "Private Use": "x-private-x-private1"
}

Files

  • data/consolidated_langs.json: Contains language data in JSON format used for the lookup.
  • data/region_names.json: Contains region data in JSON format used for the BCP47 lookup.
  • data/script_codes.json: Contains script code data in JSON format used for the BCP47 lookup.
  • data/deprecated-639-3.csv: Contains deprecated ISO 639-3 codes in CSV format, for quick reference.

Errata

There are known issues with the BCP47 parser in the web interface. It uses regular expressions to validate input, such that:

Examples of valid tags:

  • en

  • fr-CA

  • i-klingon

  • az-Arab-IR

  • sr-Cyrl-RS

  • zh-cmn-Hans

  • ja-JP-x-tokyo

  • uz-Cyrl-UZ-1992

  • bo-Tibt-x-dialect

  • zh-cmn-Hans-CN-x-private1

  • hy-Latn-IT-arevela-x-test

Examples of invalid tags (malformed):

  • en-GB-oed-x-private

  • de-CH-1901-co-phonebk-sc-gothic-x-bavaria

(and more)

Examples of inputs that reveal parsing bugs:

  • ca-valencia-nedis (Highlighted input section is missing "valencia")

  • en-US-u-islamcal (Variant "u" and Extension "islamcal", Extension section says "u - islamcal")

  • es-419-fonipa (Extended languages blank)

  • de-Latf-1901 (Region undefined)

  • sl-rozaj (rozaj is coloured differently in the result container to how it is in the highlighted input section)

Contributing

Contributions, issues, and feature requests are welcome!

Author

Written by T E Vaughan

Sponsorship

Github-sponsors

If you find this project useful, please consider sponsoring my work. <3

Related Standards and RFCs

The codes used in this program conform to the following ISO standards:

Standards

RFCs

  • RFC 1766 Tags for the Identification of Languages
  • RFC 4646 Tags for Identifying Languages
  • RFC 4647 Matching of Language Tags

License

This project is MIT licensed.