-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) #37
Comments
The cli we use to query the main 1603_1_1.py (and glue other operations, including exporting Codex) not surprisingly is getting a lot of options. For sake of future reference, I will leave here what was at this moment. Since some options like --codex_de are shared, I think it is better to create options that state explicitly what operation is under active use instead of implicitly assuming from most common uses. The full example (need click)./99999999/0/1603_1.py --helpfititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ ./999999999/0/1603_1.py --help usage: 1603_1 [-h] [--punctum-separato-de-resultatum [RESULTATUM_SEPARATO]] [--punctum-separato-de-fontem [FONTEM_SEPARATO]] [--de-archivum] [--data-apothecae-ad [DATA_APOTHECAE_AD]] [--data-apothecae-ex DATA_APOTHECAE_EX] [--data-apothecae-ex-archivo DATA_APOTHECAE_EX_ARCHIVO] [--data-apothecae-formato [{datapackage,sqlite}]] [--dictionaria-numerordinatio] [--codex-de [CODEX_DE]] [--objectivum-linguam [OBJECTIVUM_LINGUAM]] [--auxilium-linguam AUXILIUM_LINGUAM] [--codex-copertae] [--codex-in-tabulam-json] [--status-quo] [--status-in-markdown] [--status-in-datapackage] [--ex-librario [EX_LIBRARIO]] [--ex-opere-temporibus [EX_OPERE_TEMPORIBUS]] [--quaero-ix_n1603ia [QUAERO_IX_N1603IA]] [--quaero-numerordinatio [QUAERO_NUMERORDINATIO]] [--in-limitem [IN_LIMITEM]] [--in-ordinem [{numerordinatio,chaos}]] [--objectivum-formatum-asciidoctor] [infile] Explain the dictionaries positional arguments: infile HXL file to read (if omitted, use standard input). optional arguments: -h, --help show this help message and exit --punctum-separato-de-resultatum [RESULTATUM_SEPARATO] Character(s) used as separator for generate output. Used only for tabular results. Defaults to tab " " --punctum-separato-de-fontem [FONTEM_SEPARATO] Character(s) used as separator from input file Used only for tabular results. Defaults to comma "," Archivum: (DEFAULT USE) Use archive as source (directory not ready yet) --de-archivum Parse single archive Data apothēcae: data apothēcae. (One) Warehouse of datasets. Compile selected dictionaries to a single place (likely single database entry point) --data-apothecae-ad [DATA_APOTHECAE_AD] Path to file (or reference to database) to store result --data-apothecae-ex DATA_APOTHECAE_EX Comma-separated list of dictionaries to initialize --data-apothecae-ex-archivo DATA_APOTHECAE_EX_ARCHIVO Path to file with list (one item per line) of dictionaries to initialize --data-apothecae-formato [{datapackage,sqlite}] Output format. Default will try make a guess from --data-apothecae-ad pattern. Dictionaria: Generate dictionaries. No input required (uses disk 1603 and 999999999/1603 data files) --dictionaria-numerordinatio Dictionary of all possible values on stricter Numerordĭnātĭo (HXLStantad container) --objectivum-formatum-asciidoctor (Default) Output Asciidoctor format Codex: Book/manual creation --codex-de [CODEX_DE] Generate documentation of dictionaries --objectivum-linguam [OBJECTIVUM_LINGUAM] Target natural language (use if not auto-detected). Must be like {ISO 639-3}-{ISO 15924}. Example: arb-Arab. Default: mul-Zyyy --auxilium-linguam AUXILIUM_LINGUAM Define auxiliary languages Must be like {ISO 639-3}-{ISO 15924}. Example: "ina-Latn,ile-Latn" Accepts multiple values. --codex-copertae Pre-calculate the codex, but only generate Codex cover (SVG) --codex-in-tabulam-json Pre-calculate the codex, but only generate Tabular Data (MediaWiki syntax 1) (JSON). See https://www.mediawiki.org/wiki/Help:Tabular_Data Status quō: Calculate current situation. Used to take other actions. Requires --codex-de 1603_NN_NN (focused Codex). Works with --quaero-ix_n1603ia. --status-quo Compute the status quo, using a codex as initial reference --status-in-markdown Return status in Markdown (instead of YAML) --status-in-datapackage Return status in frictionless datapackage.json. With --ex-librario returns profile data-package-catalog. (low level of details) --ex-librario [EX_LIBRARIO] Status novō. New state. Persist changes if necessary Opus temporibus: Crontab/cronjob information --ex-opere-temporibus [EX_OPERE_TEMPORIBUS] ex opere temporibus. Out of work times (crontab) --quaero-ix_n1603ia [QUAERO_IX_N1603IA] Query ix_n1603ia. Rudimentar && (AND) and || (OR). Use var<1 to test 0 or undefined. Query ix_n1603ia. Filter. Ex. "{publicum}>10 && {internale}<1" --quaero-numerordinatio [QUAERO_NUMERORDINATIO] Query Numerordĭnātĭo. Additional filter list for focused base of dictionaries. Ideal to check if some groups meet other filters. Example: if result return empty and other queries are to check if need to fetch again from Wikidata Q, then you assume no new fetch is necessary --in-limitem [IN_LIMITEM] /Against the limit of/. Limit maximum number of cron jobs to show. --in-ordinem [{numerordinatio,chaos}] /Against arrangement (ordering) of/. Sort result list to this rule. Options: numerordinatio=natural order; chaos=random order Exemplōrum gratiā: printf "#item+conceptum+codicem,#item+rem+i_qcc+is_zxxx+ix_wikiq" | ./999999999/0/1603_1.py --de-archivum cat 1603/1/1/1603_1_1.no1.tm.hxl.csv | ./999999999/0/1603_1.py --de-archivum ./999999999/0/1603_1.py --de-archivum 1603/1/1/1603_1_1.no1.tm.hxl.csv ./999999999/0/1603_1.py --dictionaria-numerordinatio ./999999999/0/1603_1.py --codex-de 1603_63_101 ./999999999/0/1603_1.py --codex-de 1603_63_101 --codex-copertae ./999999999/0/1603_1.py --codex-de 1603_63_101 --codex-in-tabulam-json ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo --ex-librario="cdn" ./999999999/0/1603_1.py --codex-de 1603_63_101 --status-quo --ex-librario="locale" --status-in-markdown ./999999999/0/1603_1.py --codex-de 1603_63_101 --ex-opere-temporibus='cdn' ./999999999/0/1603_1.py --ex-opere-temporibus='cdn' --quaero-ix_n1603ia='({publicum}>=9)&&({victionarium_q}>9)' ./999999999/0/1603_1.py --data-apothecae-ex='1603_45_1,1603_45_31' --data-apothecae-ad='apothecae.datapackage.json' ./999999999/0/1603_1.py --data-apothecae-ex='1603_45_1,1603_45_31' --data-apothecae-ad='apothecae.sqlite' Edited: text formating. |
…es to SQLite (still with some off-by-one error)
…ite (still need to fix issue when tables have over 200 columns, e.g. with translations)
Oh f**k me. Except by SQLite (which is quite flexible) PostgreSQL (as it likely to happens with other databases) okay with TODOIn addition to a RDF with HXL and RDF with BCP47 language tags, we will need... something which could still be compatible, but if user would ingest it on SQL databases, the new reversible identifier should be predictable and reversible. On database format, we may need to have some additional way only to store the relations of the pivots without need this become part of the header alone (as we do with with tabular format on CSV) |
By the way, the naming strategy for the tables already is very, very compact and predictable. They may be weird for humans, but as most users would use better interfaces, the use of numbers is a non-issue. However, the way to encode the columns (at least outside SQLite or plain CSV) would need more fine tunning. Anyway, we can still take more time to make the imports to PostgresSQL/MySQL (traditional databases). Anyway, we're already looking into R2RML (https://www.w3.org/TR/r2rml/) or, more generally, Ontology Based Data Access (OBDA), so this naturally would already take some time to think about the way to name the columns on relational database format. Most strategies the R2RML is created by humans (even if with graphical interfaces) but we cannot do this way. So the way to name the columns migth also be done in such way that simplify R2RML |
… and BCP47, not yet entire header group
Done first test. The FKs are not optimized, but now is viable to also load some more standard SQL databases. TL;DR:
1. Generate the datapackage (this step allow select what should be imported from CSVs to other storages)DATA_APOTHECAE_MINIMIS=1 ./999999999/0/1603_1.py --methodus='data-apothecae' --data-apothecae-ad-stdout --data-apothecae-formato='datapackage' --data-apothecae-ex-suffixis='no1.bcp47.csv' --data-apothecae-ex-praefixis='1603_45_16' > ./apothecae~1603_45_16.datapackage.json 1.1 SQLite (already was working)SQLite is quite permissive on what it accepts. It was already allowing before data fields with over 60 characters (PostgreSQL and others don't) ./999999999/0/frictionless_to_sqlite.py --datapackage='apothecae~1603_45_16.datapackage.json' --sqlite='999999/0/apothecae~1603_45_16.sqlite' 1.2 PostgreSQLNote: this requires PostgreSQL already installed (SQLite can work with file based access) ./999999999/0/frictionless_to_postgresql.py --datapackage='apothecae~1603_45_16.datapackage.json' --postgresql='fititnt:pass@localhost/mdciii' 1.3 Microsoft Excel (friccionless actually also accept Libreoffice, but not implemented yet)Turns out that Excel can load fine over 470 sheets. The file on disk is 17.3MB, but I think it is using over 470mb of RAM via Libreoffice. However, at least if the final intent would be use with command line after, I really would recommend anyone export every sheet to CSV or other storage, because opening and closing sheet by sheet via command line would require a startup time very, very significative (the way it is stored is a zip with XMLs, not as optimized as would be CSVs or SQLite) time ./999999999/0/frictionless_to_excel.py --datapackage='apothecae~1603_45_16.datapackage.json' --excel='999999/0/apothecae~1603_45_16.xlsx'
# real 1m56,411s
# user 1m55,135s
# sys 0m1,042s
|
Idea of this issue
TODO: Experimental CLI feature to bootrapp a database from selected dictionaries (...somewhat equivalent to bootstrap a data warehouse)
Do not make sense pre-generate binary databases for end users, somewhat a waste of space. Also, users could be more interested in some dictionaries than others, so even a near single global database would both be too big, potentially be in an inconsistent state from time to time, and obviously make the compilation times absurdly huge.
However soon or later people (or at least we, for our internal use) could want to ingest everything of interest on some relational database. In fact, this would be a side effect of better data formats to explain the datasets such as the frictionless or W3C Tabular Data.
However, we can cut a lot of time (and too much pain, like commands to re-ingest dictionaries again one by one) by simply allowing (even if using the experimental features of friccionesdata) already optimized to create the full database with already selected groups of dictionaries. This also would be more aligned with the philosophy of automating what would take more documentation AND could help get a better overview of the datasets without going one by one.
Other comments
The common use case here assume data related to dictionaries can be re-bootstrapped and, when finished, no more writes would occur (at least not on the reference tables). So SQLite would be a perfect case (even for production use and huge databases, as long as no concurrent writes are necessary). However PostgreSQL (or whatever use would want to convert the SQLite) would be another alternative.
Open room for conventions to store Common Operational Datasets (at least COD-ABs)
[1603.45.16]
/"Ontologia"."United Nations"."P"/@eng-Latn #2While the dictionaries we're doing have their index handcrafted (even if the terminology translations are compiled with software) the perfect first candidates to optimize to users ingest in a predictable way would be CODs.
Note: in case we fetch data from other sources (such as @digital-guard) the actual use case here would be focus on live data, not archived data.
Before go to CODs, means optimize dictionaries that explain then
To have a sane way to ingest data, we would fist start to have dictionaries from [1603:??] Geographia (create base numerospace) #31 already done.
Our dictionaries can reuse other dictionaries (so the things get better over time) and at least on concepts related to places, the number to access the dictionary can actually mean the country.
The text was updated successfully, but these errors were encountered: