Mini-Kedro makes it possible to use the DataCatalog
functionality in Kedro.
Use Kedro to configure and explore data sources in a Jupyter notebook using the DataCatalog feature.
The DataCatalog allows you to specify data sources that you interact with for loading and saving purposes using a YAML API. See an example:
# conf/base/catalog.yml
example_dataset_1:
type: pandas.CSVDataSet
filepath: folder/filepath.csv
example_dataset_2:
type: spark.SparkDataSet
filepath: s3a://your_bucket/data/01_raw/example_dataset_2*
credentials: dev_s3
file_format: csv
save_args:
if_exists: replace
Later on, you can build a full pipeline using the same configuration, with the following steps:
- Create a new empty Kedro project in a new directory
kedro new
Let's assume that the new project is created at /path/to/your/project
.
- Copy the
conf/
anddata/
directories over to the new project
cp -fR {conf,data} `/path/to/your/project`
This makes it possible to use something like df = catalog.load("example_dataset_1")
and df_2 = catalog.save("example_dataset_2")
to interact with data in a Jupyter notebook.
The advantage of this approach is that you never need to specify file paths for loading or saving data in your Jupyter notebook.
Create a new project using this starter:
$ kedro new --starter=mini-kedro
The starter contains:
- A
conf/
directory, which contains an exampleDataCatalog
configuration (catalog.yml
) - A
data/
directory, which contains an example dataset identical to the one used by the kedro-starter-pandas-iris starter - An example notebook showing how to instantiate the
DataCatalog
and interact with the example dataset - A
README.md
, which explains how to use the project created by the starter and how to transition to a full Kedro project afterwards