diff --git a/README.md b/README.md index f58de2f9..bcee2022 100755 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@

- Yaetos Project + Yaetos Project

@@ -16,13 +16,26 @@ Yaetos is a framework to write data pipelines on top of Pandas and Spark, and de - In the simplest cases, pipelines consist of SQL files only. No need to know any programming. Suitable for business intelligence use cases. - In more complex cases, pipelines consist of python files, giving access to Pandas, Spark dataframes, RDDs and any python library (scikit-learn, tensorflow, pytorch). Suitable for AI use cases. +It integrates several popular open source systems: + +

+ Airflow +      + Spark +      + DuckDB +      + Pandas +

+ + Some features: * The ability to run jobs locally and on a cluster in the cloud without any changes. * The support for dependencies across jobs * The support for incremental jobs * The automatic creation of AWS clusters when needed. * The support for git and unit-tests - * The integration with any python library to build machine learning or other pipelines. + * The ability to integrate any python library in the process (ex: machine learning libraries). ## To try @@ -58,18 +71,24 @@ Then, open a browser, go to `http://localhost:8888/tree/notebooks`, open [inspe ## Development Flow -To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run the AWS cluster or in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run locally. To run the jobs, execute the command lines following the same patterns as above: +To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml). To run the jobs, execute the command lines following the same patterns as above: python jobs/generic/launcher.py --job_name=examples/some_sql_file.sql # or python jobs/examples/some_python_file.py -And add the `--deploy=EMR` to deploy and run on an AWS cluster. - -You can specify dependencies in the job registry, for local jobs or on AWS cluster. +Extra arguments: + * To run the job with its dependencies: add `--dependencies` + * To run the job in the cloud: add `--deploy=EMR` + * To run the job in the cloud on a schedule: add `--deploy=airflow` Jobs can be unit-tested using `py.test`. For a given job, create a corresponding job in `tests/jobs/` folder and add tests that relate to the specific business logic in this job. See [tests/jobs/ex1_frameworked_job_test.py](tests/jobs/ex1_frameworked_job_test.py)for an example. +Depending on the parameters chosen to load the inputs (`'df_type':'pandas'` in [conf/jobs_metadata.yml](conf/jobs_metadata.yml)), the job will use: + * Spark: for big-data use cases in SQL and python + * DuckDB and Pandas: for normal-data use cases in SQL + * Pandas: for normal-data use cases in python + ## Unit-testing ... is done using `py.test`. Run them with: @@ -108,7 +127,6 @@ The status of the job can be monitored in AWS in the EMR section. ## Potential improvements * more unit-testing - * integration with other scheduling tools (airflow...) * integration with other resource provisioning tools (kubernetes...) * adding type annotations to code and type checks to CI * automatic pulling/pushing data from s3 to local (sampled) for local development diff --git a/docs/images/AirflowLogo.png b/docs/images/AirflowLogo.png new file mode 100644 index 00000000..4df5f392 Binary files /dev/null and b/docs/images/AirflowLogo.png differ diff --git a/docs/images/Apache_Spark_logo.svg.png b/docs/images/Apache_Spark_logo.svg.png new file mode 100644 index 00000000..d719358f Binary files /dev/null and b/docs/images/Apache_Spark_logo.svg.png differ diff --git a/docs/images/DuckDB_Logo.png b/docs/images/DuckDB_Logo.png new file mode 100644 index 00000000..8691c6f5 Binary files /dev/null and b/docs/images/DuckDB_Logo.png differ diff --git a/docs/images/Pandas_logo.svg.png b/docs/images/Pandas_logo.svg.png new file mode 100644 index 00000000..15615146 Binary files /dev/null and b/docs/images/Pandas_logo.svg.png differ diff --git a/docs/images/logo_full_2_transp.png b/docs/images/logo_full_2_transp.png new file mode 100644 index 00000000..67c06373 Binary files /dev/null and b/docs/images/logo_full_2_transp.png differ diff --git a/docs/index.rst b/docs/index.rst index ce7fe830..4a0ce320 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,7 +1,7 @@ Welcome to Yaetos' documentation! ================================= -.. image:: ./images/logo_full.jpeg +.. image:: ./images/logo_full_2_transp.png :width: 300 :alt: Logo :align: center