-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MVP of better management of Dictiōnāria + Cōdex automated updated (necessary for future schedule/cron) #29
Comments
Few weeks ago in 1603:1:1 column to add tags was added. Since the cron jobs could evolve over time and also allow more complex rules, then the cli to output result could also allow pre-filter based on such tags. This somewhat would require more options to the CLI, but at same time allow it to be more flexible. For example, we have different tags for both what is considered "public" (something can be both for internal and public use, like 1603:1:51) and what uses Wikidata Q. The first case is a candidate to update a release on publishing channels (like CDN, but could be CKAN or something). The second tag is a hint that such a group of dictionaries actually requires update translations from Wikidata Q (which means more downloads). Something to obviously implement is a way to check if the dictionaries were recently published (maybe default to near 7 days?) based on previous saved status and consider this if need re-run. This could distribute the jobs. However such features obviously would need to be ignored if running as local testing or if humans require to update despite default rules. Note: A hard-coded filter is only to try to process what on 1603:1:1 has at least one tag. At the moment not all dictionaries which already have content have such tags. |
…ed to parse --quaero-ix_n1603ia=... rules
Ok. First time full drill tested yesterday. Beyond load groups of dictionaries from the main table [1601:1:1], the only filter implemented was Some small bugs. Some of them are related to old strategy (only act if file is missing on disk) but that was too primitive to scale at this point. Notable to do's1 configurable limit of cron jobs to returnDepending on the context, all (or very large number) dictionaries would be marked to update immediately. This is a recipe for things to go wrong. For example:
One way to help with this is to simply at least configure the number of max jobs to return with --ex-opus-tempora(...) 2 configurable sort cron jobs (generic)Reasoning is the same as previous points. Currently the default shorting is to use order of Numerordinatio of the Cōdex, but makes sense to expose the ordering by additional CLI parameters. 3 opinionated sort of cron jobs to deal with failsThe precious topic may be too complicated to generalize, so this point tends to be better done from the programming side. All other configurations are likely to suggest jobs that failed recently, so it could get stuck. We also need to consider cases we're the fail is not servers, but misconfiguration of the reference tables (edited by humans) which can make weird bugs (like generate an invalid SPARQL query) so the cronjob manager could give up all other works thinking it's a busy hour, while the issue is specific to one place. This doesn't mean push failed jobs at the very end of the list, but for sure they shouldn't be the first ones. 4 requisites to decide if Wikidata Q needs update: CRC Wikidata Q codes per focused dictionaries group + CRC of concepts from 1603:1:51 + default time to assume Wikidata Q is always staleThe Wikidata Q already is quite intensive. We didn't implement more than labels, so it is realistic think it could easily get much, much more intense with other data (such as at least also the alternative labels). Assuming last successful run time, there are additional reasons to consider immediately stale (which is relevant to run locally), which is if concepts of the focused group of dictionaries changed Wikidata Q codes or... the number of languages on 1603:1:51 changed. 5 define some defaults to allow weekly updatesThis can't be exactly 7 days because old runs have natural delays to finish (from 1 min to 6 min; likely more in special if add more PDF versions) so it always would need a way for the next run. However the major reason would be either the network (like timeout errors on Wikidata) or last run a week ago was so full it would need to run again. Maybe 6 days? 6 define some default value to be 100% all data is stale (then, after downloads, maybe check hash's); maybe 14 days? 30 days?This actually should depend on several other factors. But could be used as a last resort (like if all other checks failed. Potential to do's (may be micro optimization)1 Predictable affinity to run days of the weekFrom time to time, all groups of dictionaries could be regenerated on nearly the same day, but the ideal would be that the natural tendency pushes them on a more predictable deterministic schedule. Without this always one day of the week would be prone to overload servers with too many requests. Maybe this could use some pattern on the Numerodinatio to avoid human decision. 2 notify if failed after many retry attemptsThings are expected to fail quite often, which means notifying humans without trying a few more times would make several near false positives. This need optimization (likely over the time) |
Okay. We have at least 17 group of dictionaries at the moment (already not considering the ones which are not labeled with ix_n1603ia). Some are still using old syntax (that still work) but miss some feaures of new Codex versions. Some features of this issue may left to later (when becomes more viable to do full automation on some remote worker) but most hard parts would already be done earlier. The temporary alternative to smarter scheduler: full list on random sorting + limit number of resultsFull command Both because I'm testing bugs and also to take change to start updating dictionaries not being working on, we're just doing as the title says. Another reason to do this strategy for some time is because the library is so big (and need to process all items to get a full picture) that we're adding more metatada to the 1603.cdn.statum.yml. If new feature is implemented too soon, it would require wait all the time for the full thing work (which would take hours). The need to have better general view (out of topic of this issue)The areas of the dictionaries are so different from each other, that unlikely someone looking at the full index would get a focused page on what is interested. But this is essential complexity, because the HXLTM was designed to allow totally different types of information be compiled, including with annexes (such as images), which don't make a lot of sense on most dictionaries people are aware of (but are on medical atlas). However, even without intent of create dedicated pages to some topics, we can mitigate a bit the chaos of people simply not going to the current index page at HXL-CPLP-Vocab_Auxilium-Humanitarium-API/1603_1_1 https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=2095477004 |
We're having this issues with zeroes on XLSX files, same issue as this comment wireservice/csvkit#639 (comment). We're using in2csv to extract the HXLated versions from a single XLSX file with everything instead of download one by one with hxltmcli (which uses libhxl-python and work perfect with remote GSheets and local CSVs, but XLSXs, likely some common issue with other because of package dependencies; have some edge cases). For some time the HXLTM files may have integers like "10" on CSVs with values such as "10.0". |
Humm... I think some refactoring, the empty columns started to not be removed from final .no11.tm.hxl.csv files. Need some post processing just in case. fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ f
rictionless validate 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------
# invalid: 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------
==== ===== =============== =================================================================================================================
row field code message
==== ===== =============== =================================================================================================================
None 10 blank-label Label in the header in field at position "10" is blank
None 11 blank-label Label in the header in field at position "11" is blank
None 12 blank-label Label in the header in field at position "12" is blank
None 13 blank-label Label in the header in field at position "13" is blank
None 14 blank-label Label in the header in field at position "14" is blank
None 15 blank-label Label in the header in field at position "15" is blank
None 16 blank-label Label in the header in field at position "16" is blank
None 17 blank-label Label in the header in field at position "17" is blank
None 18 blank-label Label in the header in field at position "18" is blank
None 19 blank-label Label in the header in field at position "19" is blank
None 20 blank-label Label in the header in field at position "20" is blank
None 21 blank-label Label in the header in field at position "21" is blank
None 22 duplicate-label Label "#item+rem+i_qcc+is_zxxx+ix_wikiq" in the header at position "22" is duplicated to a label: at position "5"
==== ===== =============== ================================================================================================================= The |
HUMMMMM.... the reason for the issue (empty columns) is actually quite curious: when automatically extracting data from XLSX, more rows than necessary are extracted. In theory this could be edited by the user (by deleting extra columns) but this would be too annoying to document, so lets automate it. Image with context of why it happens |
Okay. Almost there. The last validation is duplicated key column used to merge the Wikidata Q terms to the index of the dictionaries. frictionless validate 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------
# invalid: 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------
==== ===== =============== =================================================================================================================
row field code message
==== ===== =============== =================================================================================================================
None 10 duplicate-label Label "#item+rem+i_qcc+is_zxxx+ix_wikiq" in the header at position "10" is duplicated to a label: at position "5"
==== ===== =============== ================================================================================================================= The problem (which need hotfix)The hxlmerge CLI already is far beyond tested strategies, so depending of the number of the columns (not remember now excatly which byte, but after 100 languages it sure start to have it) it will discard one column and raise a error like this
This error is deterministic (always the same, like some strategy to get how many columns to check). But the way merging works, it all on memory. So the merge operations may already have more bugs because of a previous issue In this case the documented hxlmerge Potential hotfixA potential hotfix here is we create another temporary file, use a different name for the key column, and after the merge on temporary file, we discard the duplicated column. Not the ideal, but considering the amount of files and rewrites we're doing, pretty okay. Also it would not break if fixed on the library (or if breaks we would know and simply use |
Related
Since we already have several dictionaries (some more complete than others) it's getting complicated call then manually. The fetching of the Wikidata Q labels in special is prone to timeouts (it varies by hour of the day) so the thing need to deal with remote calls failing and still retrying with some delay or eventually give up and try hours later.
For sake of this Minimal Viable Product, the idea is at least start to use the 1603:1:1 as starting point to know all available dictionaries, then evoke one by one instead of going add then directly on shell scripts.
The text was updated successfully, but these errors were encountered: