Skip to content

Commit

Permalink
Merge pull request #95 from arthurprevot/readme
Browse files Browse the repository at this point in the history
Updated Readme
  • Loading branch information
arthurprevot authored Dec 2, 2023
2 parents bc669fe + af50e35 commit 307df02
Show file tree
Hide file tree
Showing 7 changed files with 26 additions and 8 deletions.
32 changes: 25 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<p align="center">
<img src="./docs/images/logo_full.jpeg" alt="Yaetos Project" width="300" height="auto"/>
<img src="./docs/images/logo_full_2_transp.png" alt="Yaetos Project" width="300" height="auto"/>
</p>

<div align="center">
Expand All @@ -16,13 +16,26 @@ Yaetos is a framework to write data pipelines on top of Pandas and Spark, and de
- In the simplest cases, pipelines consist of SQL files only. No need to know any programming. Suitable for business intelligence use cases.
- In more complex cases, pipelines consist of python files, giving access to Pandas, Spark dataframes, RDDs and any python library (scikit-learn, tensorflow, pytorch). Suitable for AI use cases.

It integrates several popular open source systems:

<p align="center">
<img src="./docs/images/AirflowLogo.png" alt="Airflow" style="width:15%" />
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="./docs/images/Apache_Spark_logo.svg.png" alt="Spark" style="width:15%" />
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="./docs/images/DuckDB_Logo.png" alt="DuckDB" style="width:15%" />
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="./docs/images/Pandas_logo.svg.png" alt="Pandas" style="width:15%" />
</p>
<!-- CSS options for above, to be tested: style="width:auto;height:50px;margin-right:20px" -->

Some features:
* The ability to run jobs locally and on a cluster in the cloud without any changes.
* The support for dependencies across jobs
* The support for incremental jobs
* The automatic creation of AWS clusters when needed.
* The support for git and unit-tests
* The integration with any python library to build machine learning or other pipelines.
* The ability to integrate any python library in the process (ex: machine learning libraries).

## To try

Expand Down Expand Up @@ -58,18 +71,24 @@ Then, open a browser, go to `http://localhost:8888/tree/notebooks`, open [inspe

## Development Flow

To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run the AWS cluster or in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run locally. To run the jobs, execute the command lines following the same patterns as above:
To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml). To run the jobs, execute the command lines following the same patterns as above:

python jobs/generic/launcher.py --job_name=examples/some_sql_file.sql
# or
python jobs/examples/some_python_file.py

And add the `--deploy=EMR` to deploy and run on an AWS cluster.

You can specify dependencies in the job registry, for local jobs or on AWS cluster.
Extra arguments:
* To run the job with its dependencies: add `--dependencies`
* To run the job in the cloud: add `--deploy=EMR`
* To run the job in the cloud on a schedule: add `--deploy=airflow`

Jobs can be unit-tested using `py.test`. For a given job, create a corresponding job in `tests/jobs/` folder and add tests that relate to the specific business logic in this job. See [tests/jobs/ex1_frameworked_job_test.py](tests/jobs/ex1_frameworked_job_test.py)for an example.

Depending on the parameters chosen to load the inputs (`'df_type':'pandas'` in [conf/jobs_metadata.yml](conf/jobs_metadata.yml)), the job will use:
* Spark: for big-data use cases in SQL and python
* DuckDB and Pandas: for normal-data use cases in SQL
* Pandas: for normal-data use cases in python

## Unit-testing
... is done using `py.test`. Run them with:

Expand Down Expand Up @@ -108,7 +127,6 @@ The status of the job can be monitored in AWS in the EMR section.
## Potential improvements

* more unit-testing
* integration with other scheduling tools (airflow...)
* integration with other resource provisioning tools (kubernetes...)
* adding type annotations to code and type checks to CI
* automatic pulling/pushing data from s3 to local (sampled) for local development
Expand Down
Binary file added docs/images/AirflowLogo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Apache_Spark_logo.svg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/DuckDB_Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pandas_logo.svg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/logo_full_2_transp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Welcome to Yaetos' documentation!
=================================

.. image:: ./images/logo_full.jpeg
.. image:: ./images/logo_full_2_transp.png
:width: 300
:alt: Logo
:align: center
Expand Down

0 comments on commit 307df02

Please sign in to comment.