Dataset types should stand alone as a module #12

mahiki · 2024-08-09T13:24:51Z

When you are working off of local datastores only its a bit clunky to have to define the connection the Prefect API and define the names of remote and local Prefect blocks.

Until you can define flow code in julia scripts there's no upside to the prefect integration, since you are writing your flow code in python and calling a julia process.

Example local julia exploratory use-case:

using DataFrames, UnicodePlots, PrefectInterfaces
ENV["PREFECT_API_URL"] = "http://127.0.0.1:4204/api"    # dev environment

# need to define both of these to use the `read(Dataset)` function
ENV["PREFECT_DATA_BLOCK_LOCAL"] = "local-file-system/datastore"
ENV["PREFECT_DATA_BLOCK_REMOTE"] = "s3-bucket/datastore"


dsz = Dataset(dataset_name = "my_cool_data_extract", datastore_type = "local")
dfz = read(dsz)
#   404×4 DataFrame
#   ..etc

If 'Dataset' module (name already taken) could stand alone from PrefectInterfaces, you could bring it on as an extention when needed. In stand alone mode, you'll need to define the filesystem block instead of calling the API url to get that:

using Dataset.local-datastore
dstore = Dataset.local-datastore()
dstore.basepath = "$HOME/toodata/templisher/dev"

And thats all you need to find datasets in your local system. You are working in julia outside of any prefect orchestration.

The text was updated successfully, but these errors were encountered:

mahiki · 2024-08-28T17:51:55Z

A key part of this is that currently read(::Dataset) is defined as a read_path function attached to a prefect block, which has a very Object Oriented structure.

The way I'm using Dataset is its just a metadata reference, mostly carrying filepath locations and local/remote labels.

I do not want to define a 'dataset' with a block, the only prefect block reference needed is the base path to the data store.

remove that read_path/write_path functionality. read(::Dataset) should take the datatype reader as an arguement, and a dataset has "csv" for example as data type. So default would be CSV.read, but can be override with a keyword argument or dispatched based on Dataset type somehow.
prefect_block.block.read_path(path_key) - this just sucks, the function is defined as a struct.function as part of the block definition.

Again, this was borrowed from the way Prefect file blocks included a read_path/write_path object method which creates too much linkage between Prefect internal object-oriented structure and the structure of my data application.

I guess this is called a 'leaky abstraction'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset types should stand alone as a module #12

Dataset types should stand alone as a module #12

mahiki commented Aug 9, 2024 •

edited

Loading

mahiki commented Aug 28, 2024

Dataset types should stand alone as a module #12

Dataset types should stand alone as a module #12

Comments

mahiki commented Aug 9, 2024 • edited Loading

mahiki commented Aug 28, 2024

mahiki commented Aug 9, 2024 •

edited

Loading