This repository reports results for the COLING 2022 paper "Assessing Digital Language Support (DLS) on a Global Scale," by Gary F. Simons, Abbey L. Thomas, and Chad K. White, [publication, preprint]. The full result set reports the observed DLS for each of the 7,829 individual languages that were included in the ISO 639-3 standard when the system was run (May 2021). This open repository presents the detailed results for a systematic sample of 10% of the languages. A dataset of the DLS levels for all languages and a web service for visualizing the results are available from Derivation. Access to the dataset is free for academic users.
The results are reported in three data files and the data columns are defined in the subsections that follow:
- List of all DLS features.csv identifies the 143 DLS features that were harvested with a count of how many languages were supported by each.
- DLS scores for 10% of languages.csv reports the Digital Language Support level and detailed scores for a 10% sample of languages.
- DLS features for 10% of languages.csv lists the specific DLS features that are supported for each of the languagges in the 10% sample.
This is a UTF-8 encoded CSV file consisting of a header row containing the column names, followed by one row for every DLS feature that was harvested. This run of the system (in May 2021) is based on 143 features. If you are aware of additional tools or apps or systems that support a significant number of languages, you are invited share your ideas with the authors ([email protected], [email protected]) so that more features can be added. The columns are as follows:
Our method uses the following seven categories of digital language support. They are listed below from easiest (most commonly supported) to hardest (least commonly supported) as determined by the results of our analysis:
- Content — A service offering content in many languages
- Encoding — A system component for representing languages (e.g., keyboards, fonts)
- Surface — A tool with surface-level processing (like spell checking or stemming)
- Localized — A system with a localized user interface (e.g., OS, browsers, messaging)
- Meaning — A tool with meaning-level processing (like machine translation)
- Speech — A tool for speech processing (e.g., speech-to-text, text-to-speech)
- Assistant — An intelligent virtual assistant
A label that is used to identify the feature in other tables of results.
An expanded description of the feature.
The number of individual languages in ISO 639-3 that the feature was found to support.
This is a UTF-8 encoded CSV file consisting of a header row containing the column names, followed by one row for each of the 783 languages in the 10% sample. The languages are listed in order of the Proportional_Score from highest to lowest, and secondarily by the ISO 639 code when the scores are the same. The columns are as follows:
The name used by ISO 639-3 as a standard name for distinguishing the language from all others. These names contain disambiguating information in parentheses when they would otherwise be identical.
The standard three-letter identifier for the language. Documentation about the language can be found by appending the identifier to URLs on various documentation sites. For example,
- https://iso639-3.sil.org/code/aaa
- https://www.ethnologue.com/language/aaa
- http://glottolog.org/glottolog?iso=aaa
- https://en.wikipedia.org/wiki/ISO_639:aaa
This column reports the language type as indicated in the ISO 639-3 code tables (https://iso639-3.sil.org/about/types). The possible values are:
- Living — There are people still living who learned the language as a first language.
- Extinct — The last known speaker died in relatively recent times (e.g. in the last few centuries). Note that this category combines what are distinguished as Dormant versus Extinct in EGIDS (as repoirted by Ethnologue).
- Ancient — The language went extinct in ancient times (e.g. more than a millennium ago).
- Historic — The language is considered to be distinct from any modern languages that are descended from it: for instance, Old English and Middle English. In these cases, the language did not become extinct; rather, it changed into a different language over time.
- Constructed — The language was created by known "inventers" rather than having evolved naturally.
The name of the DLS level to which the language is assigned: Still, Emerging, Ascending, Vital, or Thriving. See article for the formal definitions.
The raw DLS score. This is the sum of the level scores (0 to 4) on the subscales for the seven support categories. Thus raw scores range from 0 (when no digital support has been observed for the language) to 28 (when a language is maximally supported).
The adjusted DLS score. Following Item Response Theory (which Mokken scale analysis is based on; see section 5.2 of the paper), rather than scoring each test item as 0 or 1, the item can be scored as the probability that a subject would produce a positive (or correct) response on that item, given their total score on the rest of the test items. The adjusted score is the sum of all the probabilities for positive responses. The probabilities are calculated from the Item Response Function for each item; these functions are derived by means of logistic regression. In educational testing, scoring each positive response as a probability is a way of controlling for random guessing. In our application to DLS it can control for "random" developments that do not have the underpinnings of the expected lower categories of support, such as when there is a one-time philanthropic gesture by a large company or the heroic efforts of a solitary developer.
The adjusted DLS score converted to a proportion. The score ranges from 0 (when no digital support has been observed for the language) to 1.0 (when a language is maximally supported).
These columns report the total adjusted score for just the items in the named support category. The value in the Adjusted_Score column is the sum of these subscale scores.
This is a UTF-8 encoded CSV file consisting of a header row containing the column names, followed by one row for every DLS feature that was harvested for every language in the 10% sample. The languages are listed by descending order of Proportional_Score (as in the preceindg table), with the features ordered by Support_Category and Feature_Name. The columns are as follows:
The name used by ISO 639-3 as a standard name for distinguishing the language from all others. These names contain disambiguating information in parentheses when they would otherwise be identical.
The standard three-letter identifier for the language.
The adjusted proprtional DLS score (as reported in the preceding table).
Our method uses the following seven categories of digital language support. They are listed below from easiest (most commonly supported) to hardest (least commonly supported) as determined by the results of our analysis:
- Content — A service offering content in many languages
- Encoding — A system component for representing languages (e.g., keyboards, fonts)
- Surface — A tool with surface-level processing (like spell checking or stemming)
- Localized — A system with a localized user interface (e.g., OS, browsers, messaging)
- Meaning — A tool with meaning-level processing (like machine translation)
- Speech — A tool for speech processing (e.g., speech-to-text, text-to-speech)
- Assistant — An intelligent virtual assistant
A label that identifies a feature that was found to support this language. Consult the List of all DLS features to find an expanded description of the feature and the total number of individual languages in ISO 639-3 which the feature was found to support.