Intake catalogs #7

AlbertDeFusco · 2018-08-25T16:16:02Z

I can definitely imagine passing an Intake data catalog to a REST service, allowing that service to read the indicated file(s) and process them. That way the service could handle any data source that Intake can. E.g. an Intake catalog can set things up to define a visualization of a given data source, and a REST service could then plot and display what was specified...

AlbertDeFusco · 2018-08-25T16:21:52Z

@martindurant and @jbednar,

Would this make sense?

# the tranquilized function
import intake

@tranquilizer('post')
def process(catalog: intake.catalog.Catalog, source: str):
    ds = catalog.walk()[source].get()

    # do somehting

    return {'response':'Success'}

Interact with the REST API by sending a catalog.

# the catalog.yml file
sources:
  airline_flights:
    description: Airline Flight data
    driver: parquet
    args:
      urlpath: 's3://assets.holoviews.org/data/airline_flights.parq'
      storage_options: {'anon': True}

curl -X POST -F "[email protected]" -d '{"source":"airline_flights"}' http://localhost:8086/process

martindurant · 2018-08-25T16:27:21Z

Yes, something like that exactly. Presumably then a dataframe would become available on the server, to be downloaded in whichever way tranquilizer handles dataframes (csv, msgpack serialized...)

ds = catalog.walk()[source].get() -> ds = catalog[source]()

(plus you may want to allow for user parameters to be passed in the parenthesis for entries that allow them)

AlbertDeFusco · 2018-08-25T16:44:23Z

Noted, **kwargs should be easily implemented if not already working.

In the example above I wrote a catalog that referenced a remote source, but the source could also be a file that is expected to be available to server running the process() function. Would the caching capabilities of Intake be appropriate for the use case that "on first reference to a source in a catalog, download the file to a specified format"?

martindurant · 2018-08-25T16:51:18Z

Caching is implemented by downloading the original source files, not by converting to a standard internal format (although we have thought about that too). It should be specified as part of the data source spec, with a cache: entry, since not all remote sources will be have the same caching scheme. From this point of view, I'm not sure how useful caching would be for you.

Another thing to keep in mind in this context, is that tranquilizer could be a server of such specs, rather than a consumer/processor. For example, it could act as a gateway to REST services, producing the equivalent of a YAML block for a set of input parameters. I raise this here, because, for a remote source, it usually makes more sense for the client to access directly rather than to have a server do it and forward the data. The latter case, together with tranquilizer, can make sense when the client doesn't want to install Intake locally, or doesn't have access to credentials that the server does - similar to what the Intake server also does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intake catalogs #7

Intake catalogs #7

AlbertDeFusco commented Aug 25, 2018

AlbertDeFusco commented Aug 25, 2018

martindurant commented Aug 25, 2018

AlbertDeFusco commented Aug 25, 2018

martindurant commented Aug 25, 2018

Intake catalogs #7

Intake catalogs #7

Comments

AlbertDeFusco commented Aug 25, 2018

AlbertDeFusco commented Aug 25, 2018

martindurant commented Aug 25, 2018

AlbertDeFusco commented Aug 25, 2018

martindurant commented Aug 25, 2018