Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intake catalogs #7

Open
AlbertDeFusco opened this issue Aug 25, 2018 · 4 comments
Open

Intake catalogs #7

AlbertDeFusco opened this issue Aug 25, 2018 · 4 comments

Comments

@AlbertDeFusco
Copy link
Member

from @jbednar

I can definitely imagine passing an Intake data catalog to a REST service, allowing that service to read the indicated file(s) and process them. That way the service could handle any data source that Intake can. E.g. an Intake catalog can set things up to define a visualization of a given data source, and a REST service could then plot and display what was specified...

@AlbertDeFusco
Copy link
Member Author

@martindurant and @jbednar,

Would this make sense?

# the tranquilized function
import intake

@tranquilizer('post')
def process(catalog: intake.catalog.Catalog, source: str):
    ds = catalog.walk()[source].get()

    # do somehting

    return {'response':'Success'}

Interact with the REST API by sending a catalog.

# the catalog.yml file
sources:
  airline_flights:
    description: Airline Flight data
    driver: parquet
    args:
      urlpath: 's3://assets.holoviews.org/data/airline_flights.parq'
      storage_options: {'anon': True}
curl -X POST -F "[email protected]" -d '{"source":"airline_flights"}' http://localhost:8086/process

@martindurant
Copy link

Yes, something like that exactly. Presumably then a dataframe would become available on the server, to be downloaded in whichever way tranquilizer handles dataframes (csv, msgpack serialized...)

ds = catalog.walk()[source].get() -> ds = catalog[source]()

(plus you may want to allow for user parameters to be passed in the parenthesis for entries that allow them)

@AlbertDeFusco
Copy link
Member Author

Noted, **kwargs should be easily implemented if not already working.

In the example above I wrote a catalog that referenced a remote source, but the source could also be a file that is expected to be available to server running the process() function. Would the caching capabilities of Intake be appropriate for the use case that "on first reference to a source in a catalog, download the file to a specified format"?

@martindurant
Copy link

Caching is implemented by downloading the original source files, not by converting to a standard internal format (although we have thought about that too). It should be specified as part of the data source spec, with a cache: entry, since not all remote sources will be have the same caching scheme. From this point of view, I'm not sure how useful caching would be for you.

Another thing to keep in mind in this context, is that tranquilizer could be a server of such specs, rather than a consumer/processor. For example, it could act as a gateway to REST services, producing the equivalent of a YAML block for a set of input parameters. I raise this here, because, for a remote source, it usually makes more sense for the client to access directly rather than to have a server do it and forward the data. The latter case, together with tranquilizer, can make sense when the client doesn't want to install Intake locally, or doesn't have access to credentials that the server does - similar to what the Intake server also does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants