Skip to content

Preparation to move hxltmcli, hxltmdexml, ontologia/cor.hxltm.yml and documentation at https://hdp.etica.ai/hxltm to exclusive repository #25

Open
@fititnt

Description

@fititnt

The EticaAI/HXL-Data-Science-file-formats already is some sort of monorepo (see https://en.wikipedia.org/wiki/Monorepo) but if the recent simplifications to make require less dependencies already did not made it better to divide, by trying to apply to more real test cases, like the Translation Initiative for COVID-19 (and assuming any other initiative would have much less people with information technology background, so TICO-19 actually is some of the best case scenarios) I myself believe that the HXLTM, even if some improvements to make more friendly to deal with bilingual files, should at least be much more documented.

Note that in general, bilingual is the supposed to be one of the easier cases (HXLTM focus on multilingual by default). But the way people submitted translations to TICO-19 (as translation pairs) make this type of optimization need.

Beyond just "software documentation"

One of the early challenges on the TICO-19 conversion actually is not even file conversion. Obviously, since is SO MANY LANGUAGUES, the merge back, like described here fititnt/hxltm-action#5 (comment), start to get very repetitive.

Maybe even document how users could drop files on some folder (maybe even with drivers to fetch from Google Drive or other averange user preference, so they would not need know git or something).

The language codes problem

The way different providers use to explain what the terms of a language are is not consistent. And this break hard any automation. Assuming average big providers would follow IETF BCP 47 language tag as per specification is too optimistic, so if they read how to use the hxltmcli /hxltmdexml and the ontologia, is reasonable to assume we will have to give a crash course on other standards.

About minimum standards on how to collect terminology

I will not talk a lot of this on this issue, but even more critical than the decision of language codes be something that really means what someone could submit to some more global initiative, one of the main challenges still how the translations are collected. So, if we create a dedicated place that explains how to use the data convention, and (even without create dedicated "best practices") give intentional nudges on how to cope with anti-patterns on terminology translations, this could give a hint that the quality of translations is heavily based on how well documented is the bootstrapping material.

Potential example approach

Maybe we even intentionally create some specialized tagging subtag for "the case of source translation is not good enough as source term" it be be used as source term when exporting formats intended for receive translations back, like XLIFF. This fix two points:

  • The first one, is anyone can hotfix translations before generate a new XLIFF, without publicly say that the source term was bad, yet without hurt translations
    • This also could be used in case of source language term have copyright.
  • The second one is tolerate translations from terms that become some sort of standard and cannot be changed because would break software.

Please note that we already have ways to add more description to terms, but if the users don't use that, we could still allow this tricky on documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    HXLTMhttps://hdp.etica.ai/hxltm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions