Storage API design #913

pwalsh · 2021-09-23T08:11:05Z

Overview

I've discussed with @roll how additional work that @akariv and I are doing with table schema drivers ( ref. https://github.com/frictionlessdata/tableschema-elasticsearch-py https://github.com/frictionlessdata/tableschema-sql-py https://github.com/frictionlessdata/tableschema-py ) could help bring forward this functionality in Frictionless, as well as in those older (but working and battle tested) libraries.

In Frictionless the Storage API is "not finished":

The Storage concept is responsible for reading and writing data package from dataset source like CKAN, SQL, or others. Currently, the Storage API is not yet finished so you can try reading the codebase and implement your own storage but you need to be ready for some changes to the API that might come.

On reflection, I'd like to better understand what are perceived as the shortcomings of the existing storage APIs, as implemented in the above libraries and a range more.

The current Storage API has an interface like:

 # pip install tableschema_sql
from tableschema import Storage
storage = Storage.connect('sql', **options)
storage.create('bucket', descriptor)
storage.write('bucket', rows)
storage.read('bucket')

Which seems like a pretty reasonable interface. Based on my usage of these libraries, I've found the following things I'd like a "better" or "robust" solution for:

Indexes for storage backends that support them (there is a working implementation in Table Schema SQL)
Table update/upsert routines:
- row identity for tests when updating (probably PK based but maybe not only)
- update payloads (subset of a row)
- table migration (an update might add a field, or add an index)
More flexible field mapping: all table schema fields need to be mapped, and consumers need to be able to trivially update the default mappings (e.g: I know I have an array that is safe for a Postgres array field, so map to that and not JSONB; I have a field that I want to map in elastic search to a keyword field and not a text field; and so on).
Comprehensive mapping of field constraints
Relational storage:
- Foreign key strategy (e.g.: on sql, normalize as FK constraint, or flatten as array field)
- Array field strategy (e.g: option to normalize into foreign key to related table)

This list is not all critical, just a list of things that I've pondered recently. I don't think the existing storage API limits any such use cases, and it seems to me that the existing API served as a starting point as good as any to iterate from.

Please preserve this line to notify @roll (lead of this repository)

The text was updated successfully, but these errors were encountered:

roll · 2021-09-23T14:44:39Z

Thanks for creating it!

My latest research showed that Storage API can't provide metadata without touching data resources. For example, I want a package from a CKAN instance package.from_ckan(...) and instead of creating a list of resources pointing to CKAN dataset's resources, it opens HTTP connections straight after the storage.read_package call.

roll · 2021-09-24T07:48:00Z

@pwalsh @akariv
I think it makes sense to write down how we use and want to use the storage API. What calls do we use / want to use.

If I understand correctly, in dataflows you use it explicitly by creating a storage instance? In Frictionless, I just need a few methods that will allow doing things like package.from_sql / package.to_sql.

Also, in Frictionless resource/parsers are more developed so we can do something like:

resource = Resource('data/table.csv')
resource.write(database_url, dialect={'table': ...}

I'm curious what logic we can move to parsers. For example, using dialects we can provide any SQL metadata like indexes etc. Also parsers might probably support upserts/etc

akariv · 2021-09-24T10:13:52Z

One thing that I found really confining was the fact that there was no concept of container - see this comment for example:
datahq/dataflows#138 (comment)

Some pseudo-code to illustrate some usage example:

>>> bucket = Storage('s3://<name-of-bucket>', **aws_creds)
>>> bucket.list()  # list files in bucket
['s3://<name-of-bucket>/my-dataset.tar.gz']
>>> tar_file = bucket.list()[0] # Actually a Storage instance, representing a tar.gz file in the S3 bucket
>>> tar_file.list()  # list files in tar file
['./datapackage.json', 'res1.csv', 'res2.csv']
>>> datapackage = tar_file.get('./datapackage.json') # Another Storage instance!
>>> datapackage.list() 
['res1', 'res2']
>>> res1 = datapackage.get('res1') # Yet another Storage instance
>>> res1.iter()
<iterator of rows>

Same concept works in a completely different scenario:

>>> instance = Storage('ckan://<host-of-ckan-instance>')
>>> dataset = instance.get('my-dataset')  # Storage instance for the CKAN dataset
>>> resources = dataset.list() # Lists all resources in the dataset
>>> resources
['datadump.csv', 'datadump.sqlite']
>>> db = resources[1] # A storage instance representing a CKAN resource
>>> db.describe()
{ ... schema of datafile ... }
>>> db.iter()
<iterator of rows>

And one last example

>>> zipfile = Storage('./archive.zip')
>>> excels = zipfile.list('*.xslx')
>>> excel_file = excels[0]
>>> report_sheets = excel_file.list('^[rReport].+$')
>>> report_sheet = report_sheets[0]
>>> report_sheet.options(headers=2)
>>> rows = list(report_sheet.iter())

roll · 2021-09-24T10:40:13Z

@akariv
Yes, exactly. We deal with containers. I've been running into these thoughts again and again. Even though it's a file directory it's a contsiner of the file resources.

The specs had been initially created not taking it into account with the most "famous" example - Excel files.

#652

roll added the general General improvements label Sep 23, 2021

roll added this to Open Knowledge Dec 26, 2021

roll moved this to Done in Open Knowledge Dec 26, 2021

roll removed the status in Open Knowledge Dec 26, 2021

roll added this to the v6 milestone Mar 21, 2022

roll self-assigned this Apr 19, 2022

roll moved this to Current in Open Knowledge Nov 18, 2022

roll mentioned this issue Nov 18, 2022

Replaced Storage API by Manager API; removed bigquery support #1304

Merged

roll closed this as completed in #1304 Nov 18, 2022

Repository owner moved this from Current to Done in Open Knowledge Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage API design #913

Storage API design #913

pwalsh commented Sep 23, 2021

roll commented Sep 23, 2021

roll commented Sep 24, 2021 •

edited

Loading

akariv commented Sep 24, 2021

roll commented Sep 24, 2021

Storage API design #913

Storage API design #913

Comments

pwalsh commented Sep 23, 2021

Overview

roll commented Sep 23, 2021

roll commented Sep 24, 2021 • edited Loading

akariv commented Sep 24, 2021

roll commented Sep 24, 2021

roll commented Sep 24, 2021 •

edited

Loading