-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage API design #913
Comments
Thanks for creating it! My latest research showed that Storage API can't provide metadata without touching data resources. For example, I want a package from a CKAN instance |
@pwalsh @akariv If I understand correctly, in Also, in Frictionless resource/parsers are more developed so we can do something like: resource = Resource('data/table.csv')
resource.write(database_url, dialect={'table': ...} I'm curious what logic we can move to parsers. For example, using dialects we can provide any SQL metadata like indexes etc. Also parsers might probably support upserts/etc |
One thing that I found really confining was the fact that there was no concept of container - see this comment for example: Some pseudo-code to illustrate some usage example: >>> bucket = Storage('s3://<name-of-bucket>', **aws_creds)
>>> bucket.list() # list files in bucket
['s3://<name-of-bucket>/my-dataset.tar.gz']
>>> tar_file = bucket.list()[0] # Actually a Storage instance, representing a tar.gz file in the S3 bucket
>>> tar_file.list() # list files in tar file
['./datapackage.json', 'res1.csv', 'res2.csv']
>>> datapackage = tar_file.get('./datapackage.json') # Another Storage instance!
>>> datapackage.list()
['res1', 'res2']
>>> res1 = datapackage.get('res1') # Yet another Storage instance
>>> res1.iter()
<iterator of rows> Same concept works in a completely different scenario: >>> instance = Storage('ckan://<host-of-ckan-instance>')
>>> dataset = instance.get('my-dataset') # Storage instance for the CKAN dataset
>>> resources = dataset.list() # Lists all resources in the dataset
>>> resources
['datadump.csv', 'datadump.sqlite']
>>> db = resources[1] # A storage instance representing a CKAN resource
>>> db.describe()
{ ... schema of datafile ... }
>>> db.iter()
<iterator of rows> And one last example >>> zipfile = Storage('./archive.zip')
>>> excels = zipfile.list('*.xslx')
>>> excel_file = excels[0]
>>> report_sheets = excel_file.list('^[rReport].+$')
>>> report_sheet = report_sheets[0]
>>> report_sheet.options(headers=2)
>>> rows = list(report_sheet.iter()) |
Overview
I've discussed with @roll how additional work that @akariv and I are doing with table schema drivers ( ref. https://github.com/frictionlessdata/tableschema-elasticsearch-py https://github.com/frictionlessdata/tableschema-sql-py https://github.com/frictionlessdata/tableschema-py ) could help bring forward this functionality in Frictionless, as well as in those older (but working and battle tested) libraries.
In Frictionless the Storage API is "not finished":
On reflection, I'd like to better understand what are perceived as the shortcomings of the existing storage APIs, as implemented in the above libraries and a range more.
The current Storage API has an interface like:
Which seems like a pretty reasonable interface. Based on my usage of these libraries, I've found the following things I'd like a "better" or "robust" solution for:
This list is not all critical, just a list of things that I've pondered recently. I don't think the existing storage API limits any such use cases, and it seems to me that the existing API served as a starting point as good as any to iterate from.
Please preserve this line to notify @roll (lead of this repository)
The text was updated successfully, but these errors were encountered: