Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Readme #95

Merged
merged 6 commits into from
Dec 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 25 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<p align="center">
<img src="./docs/images/logo_full.jpeg" alt="Yaetos Project" width="300" height="auto"/>
<img src="./docs/images/logo_full_2_transp.png" alt="Yaetos Project" width="300" height="auto"/>
</p>

<div align="center">
Expand All @@ -16,13 +16,26 @@ Yaetos is a framework to write data pipelines on top of Pandas and Spark, and de
- In the simplest cases, pipelines consist of SQL files only. No need to know any programming. Suitable for business intelligence use cases.
- In more complex cases, pipelines consist of python files, giving access to Pandas, Spark dataframes, RDDs and any python library (scikit-learn, tensorflow, pytorch). Suitable for AI use cases.

It integrates several popular open source systems:

<p align="center">
<img src="./docs/images/AirflowLogo.png" alt="Airflow" style="width:15%" />
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="./docs/images/Apache_Spark_logo.svg.png" alt="Spark" style="width:15%" />
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="./docs/images/DuckDB_Logo.png" alt="DuckDB" style="width:15%" />
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="./docs/images/Pandas_logo.svg.png" alt="Pandas" style="width:15%" />
</p>
<!-- CSS options for above, to be tested: style="width:auto;height:50px;margin-right:20px" -->

Some features:
* The ability to run jobs locally and on a cluster in the cloud without any changes.
* The support for dependencies across jobs
* The support for incremental jobs
* The automatic creation of AWS clusters when needed.
* The support for git and unit-tests
* The integration with any python library to build machine learning or other pipelines.
* The ability to integrate any python library in the process (ex: machine learning libraries).

## To try

Expand Down Expand Up @@ -58,18 +71,24 @@ Then, open a browser, go to `http://localhost:8888/tree/notebooks`, open [inspe

## Development Flow

To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run the AWS cluster or in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run locally. To run the jobs, execute the command lines following the same patterns as above:
To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml). To run the jobs, execute the command lines following the same patterns as above:

python jobs/generic/launcher.py --job_name=examples/some_sql_file.sql
# or
python jobs/examples/some_python_file.py

And add the `--deploy=EMR` to deploy and run on an AWS cluster.

You can specify dependencies in the job registry, for local jobs or on AWS cluster.
Extra arguments:
* To run the job with its dependencies: add `--dependencies`
* To run the job in the cloud: add `--deploy=EMR`
* To run the job in the cloud on a schedule: add `--deploy=airflow`

Jobs can be unit-tested using `py.test`. For a given job, create a corresponding job in `tests/jobs/` folder and add tests that relate to the specific business logic in this job. See [tests/jobs/ex1_frameworked_job_test.py](tests/jobs/ex1_frameworked_job_test.py)for an example.

Depending on the parameters chosen to load the inputs (`'df_type':'pandas'` in [conf/jobs_metadata.yml](conf/jobs_metadata.yml)), the job will use:
* Spark: for big-data use cases in SQL and python
* DuckDB and Pandas: for normal-data use cases in SQL
* Pandas: for normal-data use cases in python

## Unit-testing
... is done using `py.test`. Run them with:

Expand Down Expand Up @@ -108,7 +127,6 @@ The status of the job can be monitored in AWS in the EMR section.
## Potential improvements

* more unit-testing
* integration with other scheduling tools (airflow...)
* integration with other resource provisioning tools (kubernetes...)
* adding type annotations to code and type checks to CI
* automatic pulling/pushing data from s3 to local (sampled) for local development
Expand Down
Binary file added docs/images/AirflowLogo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Apache_Spark_logo.svg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/DuckDB_Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pandas_logo.svg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/logo_full_2_transp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Welcome to Yaetos' documentation!
=================================

.. image:: ./images/logo_full.jpeg
.. image:: ./images/logo_full_2_transp.png
:width: 300
:alt: Logo
:align: center
Expand Down
Loading