Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage API design #913

Closed
pwalsh opened this issue Sep 23, 2021 · 4 comments · Fixed by #1304
Closed

Storage API design #913

pwalsh opened this issue Sep 23, 2021 · 4 comments · Fixed by #1304
Assignees
Labels
general General improvements

Comments

@pwalsh
Copy link
Member

pwalsh commented Sep 23, 2021

Overview

I've discussed with @roll how additional work that @akariv and I are doing with table schema drivers ( ref. https://github.com/frictionlessdata/tableschema-elasticsearch-py https://github.com/frictionlessdata/tableschema-sql-py https://github.com/frictionlessdata/tableschema-py ) could help bring forward this functionality in Frictionless, as well as in those older (but working and battle tested) libraries.

In Frictionless the Storage API is "not finished":

The Storage concept is responsible for reading and writing data package from dataset source like CKAN, SQL, or others. Currently, the Storage API is not yet finished so you can try reading the codebase and implement your own storage but you need to be ready for some changes to the API that might come.

On reflection, I'd like to better understand what are perceived as the shortcomings of the existing storage APIs, as implemented in the above libraries and a range more.

The current Storage API has an interface like:

 # pip install tableschema_sql
from tableschema import Storage
storage = Storage.connect('sql', **options)
storage.create('bucket', descriptor)
storage.write('bucket', rows)
storage.read('bucket')

Which seems like a pretty reasonable interface. Based on my usage of these libraries, I've found the following things I'd like a "better" or "robust" solution for:

  • Indexes for storage backends that support them (there is a working implementation in Table Schema SQL)
  • Table update/upsert routines:
    • row identity for tests when updating (probably PK based but maybe not only)
    • update payloads (subset of a row)
    • table migration (an update might add a field, or add an index)
  • More flexible field mapping: all table schema fields need to be mapped, and consumers need to be able to trivially update the default mappings (e.g: I know I have an array that is safe for a Postgres array field, so map to that and not JSONB; I have a field that I want to map in elastic search to a keyword field and not a text field; and so on).
  • Comprehensive mapping of field constraints
  • Relational storage:
    • Foreign key strategy (e.g.: on sql, normalize as FK constraint, or flatten as array field)
    • Array field strategy (e.g: option to normalize into foreign key to related table)

This list is not all critical, just a list of things that I've pondered recently. I don't think the existing storage API limits any such use cases, and it seems to me that the existing API served as a starting point as good as any to iterate from.


Please preserve this line to notify @roll (lead of this repository)

@roll roll added the general General improvements label Sep 23, 2021
@roll
Copy link
Member

roll commented Sep 23, 2021

Thanks for creating it!

My latest research showed that Storage API can't provide metadata without touching data resources. For example, I want a package from a CKAN instance package.from_ckan(...) and instead of creating a list of resources pointing to CKAN dataset's resources, it opens HTTP connections straight after the storage.read_package call.

@roll
Copy link
Member

roll commented Sep 24, 2021

@pwalsh @akariv
I think it makes sense to write down how we use and want to use the storage API. What calls do we use / want to use.

If I understand correctly, in dataflows you use it explicitly by creating a storage instance? In Frictionless, I just need a few methods that will allow doing things like package.from_sql / package.to_sql.

Also, in Frictionless resource/parsers are more developed so we can do something like:

resource = Resource('data/table.csv')
resource.write(database_url, dialect={'table': ...}

I'm curious what logic we can move to parsers. For example, using dialects we can provide any SQL metadata like indexes etc. Also parsers might probably support upserts/etc

@akariv
Copy link
Member

akariv commented Sep 24, 2021

One thing that I found really confining was the fact that there was no concept of container - see this comment for example:
datahq/dataflows#138 (comment)

Some pseudo-code to illustrate some usage example:

>>> bucket = Storage('s3://<name-of-bucket>', **aws_creds)
>>> bucket.list()  # list files in bucket
['s3://<name-of-bucket>/my-dataset.tar.gz']
>>> tar_file = bucket.list()[0] # Actually a Storage instance, representing a tar.gz file in the S3 bucket
>>> tar_file.list()  # list files in tar file
['./datapackage.json', 'res1.csv', 'res2.csv']
>>> datapackage = tar_file.get('./datapackage.json') # Another Storage instance!
>>> datapackage.list() 
['res1', 'res2']
>>> res1 = datapackage.get('res1') # Yet another Storage instance
>>> res1.iter()
<iterator of rows>

Same concept works in a completely different scenario:

>>> instance = Storage('ckan://<host-of-ckan-instance>')
>>> dataset = instance.get('my-dataset')  # Storage instance for the CKAN dataset
>>> resources = dataset.list() # Lists all resources in the dataset
>>> resources
['datadump.csv', 'datadump.sqlite']
>>> db = resources[1] # A storage instance representing a CKAN resource
>>> db.describe()
{ ... schema of datafile ... }
>>> db.iter()
<iterator of rows>

And one last example

>>> zipfile = Storage('./archive.zip')
>>> excels = zipfile.list('*.xslx')
>>> excel_file = excels[0]
>>> report_sheets = excel_file.list('^[rReport].+$')
>>> report_sheet = report_sheets[0]
>>> report_sheet.options(headers=2)
>>> rows = list(report_sheet.iter())

@roll
Copy link
Member

roll commented Sep 24, 2021

@akariv
Yes, exactly. We deal with containers. I've been running into these thoughts again and again. Even though it's a file directory it's a contsiner of the file resources.

The specs had been initially created not taking it into account with the most "famous" example - Excel files.

#652

@roll roll moved this to Done in Open Knowledge Dec 26, 2021
@roll roll removed the status in Open Knowledge Dec 26, 2021
@roll roll added this to the v6 milestone Mar 21, 2022
@roll roll self-assigned this Apr 19, 2022
@roll roll moved this to Current in Open Knowledge Nov 18, 2022
Repository owner moved this from Current to Done in Open Knowledge Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
general General improvements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants