Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta issue] HXL and data directly from and to SQL databases #10

Open
fititnt opened this issue Feb 19, 2021 · 1 comment
Open

[meta issue] HXL and data directly from and to SQL databases #10

fititnt opened this issue Feb 19, 2021 · 1 comment
Labels
data-transformation https://en.wikipedia.org/wiki/Data_transformation

Comments

@fititnt
Copy link
Member

fititnt commented Feb 19, 2021

  • TL;DR:
    • This issue is just to have a simple place to put information related to use of SQL storages. It may or not have more than proof of concepts
    • To add value (than merely over complicate an database dump to csv and an database import; that could be done by external tools and documentation) **any command line tool should at least read directly from one or more databases and allow create valid SQL file that could be used to import data again.
      • If this becomes too hard, at least we could document scripts to convert CSV to SQL
    • Interoperability (think plan taxonomies and how to save equivalent of HXL Hashtags as database column name) are the main objective, even if this mean just prepare documentation and cut performance features that could be used by external tools, not HXL using SQL database directly
      • The direct implication of this (this is my guess, not tested) is that most hxl parser tools, not just the libhxl-python (either directly if implemented or if using as library on tools like here) still likely to not optimize the commands as conversion to SQL select equivalents and still have to work with temporary files (that could still be acceptable fast)
        • In other words: hxl importers/exporters (in theory, not tested) could not break hard if you do not have memory (like some other data mining tools would fill your memory until your computer crash) but large datasets may be slow
      • Please note that even if we compare HXL tools with some programs that can load data from databases, most of them are also optimized for files on disk, or even have to load the entire dataset on memory
        • And some enterprise ones (if already are not expensive) seems to cost even an extra to allow work directly from database than their proprietary file format.
        • But even if HXL tools or HXL-proxy could not be like super optimized for Gigabyte size database processing (like 50ms response time with very optimized index and selects) it could still be useful for who would use HXL to merge several datasets on a single place

This issue is an draft. Some extra information may be edited/added later.

@fititnt
Copy link
Member Author

fititnt commented Feb 19, 2021

TL;DR of this post: maybe the HXLMeta (Usable Class) #9 could also have an local database to use as helper.

Maybe most databases accept HXL Hashtags without changes

Good thing: PostgreSQL actually accept # as first char of column names. (Also tested with MariaDB, so maybe do exist others)

On last week's I was aware that only SQLite accepted almost everything as column name, so I was concerned on what use to replace # and + on almost every other database engine. But since maybe current databases actually allow both # and +, this simplify a lot!

Tests with csvsql

The csvsql used SQLAlchemy. The file actually is not that complex so in worst case scenario could be just Implement same thing. (But as reference: as expected the csvsql exporting from genetic csv files may be OK, but not tested yet if types would be more generic)

BUT since on issue HXLMeta we're already mapping more exact StorageTypes (and this is likely to take much more time to get right) I think that for exporting from HXLated datasets to most common SQL database, we may no need at all something like SQLAlchemy.

But for importing to HXL tools, some abstraction like SQLAlchemy (at least for python HXL tools) definely worth looking at.

SQLite as potential alternative to have an local collections of taxonomies

In addition to country/territory codes (and without resorting to load the entire P-Codes for local usage) do exist some taxonomies (at least the one for language codes) that I think eventually would be useful to have near the computer running complex inferences. On @HXL-CPLP we're already drafting taxonomies like words used to represent true/false on different languages, so maybe would be possible to make some taxonomies so important that the user could build own cache. One good initial candidate could be booleans (using 2 letter ISO codes as namespace, something like +v_un_bool for 6 UN languages and +v_eu_bool for an draft of 20+ European ones) and some way to a person "merge" more than one external source of reference.

I'm not fully sure if this would be really necessary (and, in fact, for a few tables, even a folder with plain HXLated CSVs would work). But for cases like the booleans, just a canonical single table would not be ideal (if not because of the user, then because of make harder to implement on the fly)

But anyway, in both cases (local SQLite or CSVs) something that could "build" an local database but one that could persist across executions (and also one that could work offline so if really used a lot on worst case scenario neither Google Drive could get hate limited or blocked) seems a totally win-win.

fititnt added a commit that referenced this issue Feb 19, 2021
@fititnt fititnt added the data-transformation https://en.wikipedia.org/wiki/Data_transformation label Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-transformation https://en.wikipedia.org/wiki/Data_transformation
Projects
None yet
Development

No branches or pull requests

1 participant