-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1603:1:51
] /Dictiōnāria Linguārum ad MMXXII ex Numerordĭnātĭo/@lat-Latn
#9
Comments
…ents of language codes from other sources
…ed yet) query generation to extract translations
…enerate raw SPARQL query and generate CSV/TSV directly
1603:1:51
] /Curated table of language codes/ (internal use; conversion/validation from different providers)1603:1:51
] /Dictiōnāria Linguārum ad MMXXII ex Numerordĭnātĭo/@lat-Latn
Status quoThe current working draft have only near 20 languages. And even manually add/review already take some time (and we're not even on the less know languages). Add to this that there are over 300 languages, and definitely no less than top 100 would quite often already have content for concepts we do not create from near scratch. Why?Turns out that this table will be quite important. The way we use allows to multiply manually tagged concepts with existing Wikidata terms. While it may have human errors, in general Wikipedia have quite decent self moderation. This article Property Label Stability in Wikidata (https://dl.acm.org/doi/fullHtml/10.1145/3184558.3191643) can give an idea. The linking back problemOne of the usages (for us here) in addition to compile concepts, when relevant tag existing Q/P/Ls from Wikidata. For things related to humanitarian sector, it actually quite frequent already have a lot of translations. Use case [
|
To allow the Generic tooling for explain files of published dictionaries (file validation; human explanation) #12 to be fully automatable, we're also adding explicitly some conventions on language codes to be used for what actually is concept code (an strict identifier) Open pointsThe
|
At this moment, we have around 50 languages prepared. Some However, just to get existing terms on [ Points of improvement discoveredOn Cōdex (PDF files) should display characters of all languages it contains #13 the amount of languages is such, that we already need to document how users could check then when accessing directly the CSVs and XMLs, and also we need to embed the founts on PDFs. Otherwise, people will not be able at all to render the translations. New repository tagsLabel reconciliatio-erga-verba (link: https://github.com/EticaAI/multilingual-lexicography/labels/reconciliatio-erga-verba) is the done we will use for issues related to language terms reconciliation with the concepts Currently, the only topic on this was the Wikidata MVP. The praeparatio-ex-codex (link https://github.com/EticaAI/multilingual-lexicography/labels/praeparatio-ex-codex) is focused on the Cōdex preparation. Every Cōdex have a hard dependency on [´1603:1:51'] (this issue here). Other hard dependencies will happens, but the dictionary which explain what each language is obviously is necessary. We still not have a main dictionary to explain the non linguistic concept attributes |
Example files (without even finished): Hard work pays off. The point here is that the way dictionaries are done, preparing new ones is much, much faster, and it is possible end documentation not only have language terms (in this case could go up as 227 for a concept) but also how others can review. Latin terms are so hard that this means we keep them at minimum, so over time we could automate Cōdex for other writing systems (including using other numbers than 0123456789) as it will be easier for others. Obviously there are several strategies to compile the translations for concepts, so 1603:45:1 is easier to bootstrapping (also the nature of the concepts means the major ones are highly reviewed/protected). But compared to early days of HXLTM is easier to not only have a few terminological translations, but have over 100s (and with continuous improvements) while becoming easier to bootstrapping new ones in days (actually just few hours) based on new needs. 🙂 Anyway, there are a few dozen dictionaries which are viable to compile/document without starting specific translation initiatives. And they're already encyclopedic language variants (which could be reviewed, but the baseline already is good). As strange as it may seem, this approach scales better than the data cleaning necessary on the https://github.com/EticaAI/tico-19-hxltm "terminologies". I mean, while HXLTM itself is documented to anyone scale translations, we're optimizing further to do the bootstrapping ourselves. Something like the terminologies of TICO-19 (and the way Google/Facebook translated with paid professional translators) would unlikely to be as efficient as the way we could do it using only volunteers. I'm not saying this to compete with way Google/Facebook done TICO-19, but to mitigate errors we could do ourselves if doing something similar at scale. Also both Google/Facebook complained about lack of source content with open licenses, so the best course of action already would be create content instead of use from others. |
…ginarum-limitibus (to cope with time limits of Wikidata SPARQL backend
…nks instead of need to know the limit upfront
…ges); avoid timeout; @todo need actually get the second page
… is merged again; fixes issue with last commit
…hi/) et Lingua Abasgica (Abecedarium Cyrillicum)
We're adding so manu languages recently that SPARQL backend queries may be timeouting again or some sort of error. Better breaking again. Last time we asked all Q itens, but break in 3 batches the language translations. Then these languages are merged using HXL Standard cli tools. |
…a Afganica (Abecedarium Arabicum) et Macrolinguae Quechuae (Abecedarium Latinum))
Fantastic! On the screenshot, except by the number of pages (we're using A5 'pocket format', not A4, so it doubles the pages) and additional concept quantity, it is possible to compare the difference. We have more than total options for Q1065 because the Cōdex:
Next stepsWe could at least add language for what has at least a first page of Wikipedia (I think it would be around 300, not far from what we have now). This is still not every option Wikidada could provide. However, we can simply add a small number of languages as people get interested. Is not a problem to add languages which are not yet on Wikidata. How relevant this table isFor now we're using Wikimedia, but we can interlink with potential other sources. But the automation here allows us to get more and more efficient. However, particular for macrolanguages there's A LOT of missing codes and they will need a lot of discussion, since we could have more volunteers than well documented codes to share their work. The work behind the Cōdex [1603:1:51] //Dictiōnāria Linguārum// was both to explain what exist, and make easier for new ones. The dictionaries are getting bigger while allowing structured translation InitiativesWhat we're doing is very technical. It's so specialized (both how hardcore is to glue the the technological part AND understanding about the languages) that is allowing we be very efficient to bridge the people willing to help with humanitarian and human rights in general. There's far more people whilling to help with causes than capacity to deal with their contributions in a way that is very shareable and reusable. Even without call to actions we already have decent compilations. But the ideal use case would be to document how people could add translations via Wikidata (without need to create Wikipedia pages). This would start to fill a lot of gaps. Most people know how to help on Wikipedia, but Wikidata (unless for concepts which a lot of visibility, such as the Q1065) is quite friendly to new translations. Our pending submission on The Humanitarian Data Exchange is not incompetence from our sideWhile we still waiting to The Humanitarian Data Exchange (https://data.humdata.org/) allow us @HXL-CPLP / @EticaAI be accepted (like they're supposed to do) in the meantime most features will tend to be related to make easier to humans make corrections on already encyclopedic-level content**. I mean: we already are optimizing for what comes later, **but while waiting, yes, we're taking notes on how hard is to be accepted. Under no circumstances we will accept any sort of organization from global north subjecting us to any type of partnerships with actually are more harmful than actually care about affected people just because they are allowed to share the work we're have no explanation yet why is not considered humanitarian. By the way, is we're not just "dumping" Wikidata labels, but doing research on areas that need to be focused (and I mean not only our discussions on https://github.com/SEMICeu/Core-Person-Vocabulary, there's much more going on) and preparing concepts and documentation is far advanced to what would be expected even from entire initiatives which typically would only share final work in English (likely only PDF format). We're not just providing terminology translations to 100's of languages in machine readable format, but also even in English, since international community fail even on the basics of data interoperability. |
While we could pack several external existing language codes for data exchange, we will definely use some languages much more heavily. Also, some data source providers can actually use non-standard codes, so soon or later we would need to do this.
The text was updated successfully, but these errors were encountered: