Merge pull request #95 from arthurprevot/readme

Updated Readme
arthurprevot · Dec 2, 2023 · 307df02 · 307df02
2 parents bc669fe + af50e35
commit 307df02
Show file tree

Hide file tree

Showing 7 changed files with 26 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 <p align="center">
-	<img src="./docs/images/logo_full.jpeg" alt="Yaetos Project" width="300" height="auto"/>
+	<img src="./docs/images/logo_full_2_transp.png" alt="Yaetos Project" width="300" height="auto"/>
 </p>
 
 <div align="center">
@@ -16,13 +16,26 @@ Yaetos is a framework to write data pipelines on top of Pandas and Spark, and de
  - In the simplest cases, pipelines consist of SQL files only. No need to know any programming. Suitable for business intelligence use cases.
  - In more complex cases, pipelines consist of python files, giving access to Pandas, Spark dataframes, RDDs and any python library (scikit-learn, tensorflow, pytorch). Suitable for AI use cases.
 
+It integrates several popular open source systems:
+
+<p align="center">
+  <img src="./docs/images/AirflowLogo.png" alt="Airflow" style="width:15%" />
+  &nbsp;&nbsp;&nbsp;&nbsp;
+  <img src="./docs/images/Apache_Spark_logo.svg.png" alt="Spark" style="width:15%" /> 
+  &nbsp;&nbsp;&nbsp;&nbsp;
+  <img src="./docs/images/DuckDB_Logo.png" alt="DuckDB" style="width:15%" />
+  &nbsp;&nbsp;&nbsp;&nbsp;
+  <img src="./docs/images/Pandas_logo.svg.png" alt="Pandas" style="width:15%" />
+</p>
+<!-- CSS options for above, to be tested: style="width:auto;height:50px;margin-right:20px" -->
+
 Some features:
  * The ability to run jobs locally and on a cluster in the cloud without any changes.
  * The support for dependencies across jobs
  * The support for incremental jobs
  * The automatic creation of AWS clusters when needed.
  * The support for git and unit-tests
- * The integration with any python library to build machine learning or other pipelines.
+ * The ability to integrate any python library in the process (ex: machine learning libraries).
 
 ## To try
 
@@ -58,18 +71,24 @@ Then, open a browser, go to `http://localhost:8888/tree/notebooks`, open  [inspe
 
 ## Development Flow
 
-To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run the AWS cluster or in [conf/jobs_metadata.yml](conf/jobs_metadata.yml) to run locally. To run the jobs, execute the command lines following the same patterns as above:
+To write a new ETL, create a new file in [ the `jobs/` folder](jobs/) or any subfolders, either a `.sql` file or a `.py` file, following the examples from that same folder, and register that job, its inputs and output path locations in [conf/jobs_metadata.yml](conf/jobs_metadata.yml). To run the jobs, execute the command lines following the same patterns as above:
 
     python jobs/generic/launcher.py --job_name=examples/some_sql_file.sql
     # or
     python jobs/examples/some_python_file.py
 
-And add the `--deploy=EMR` to deploy and run on an AWS cluster.
-
-You can specify dependencies in the job registry, for local jobs or on AWS cluster.
+Extra arguments:
+ * To run the job with its dependencies: add `--dependencies`
+ * To run the job in the cloud: add `--deploy=EMR`
+ * To run the job in the cloud on a schedule: add `--deploy=airflow`
 
 Jobs can be unit-tested using `py.test`. For a given job, create a corresponding job in `tests/jobs/` folder and add tests that relate to the specific business logic in this job. See [tests/jobs/ex1_frameworked_job_test.py](tests/jobs/ex1_frameworked_job_test.py)for an example.
 
+Depending on the parameters chosen to load the inputs (`'df_type':'pandas'` in [conf/jobs_metadata.yml](conf/jobs_metadata.yml)), the job will use:
+ * Spark: for big-data use cases in SQL and python
+ * DuckDB and Pandas: for normal-data use cases in SQL
+ * Pandas: for normal-data use cases in python
+
 ## Unit-testing
 ... is done using `py.test`. Run them with:
 
@@ -108,7 +127,6 @@ The status of the job can be monitored in AWS in the EMR section.
 ## Potential improvements
 
  * more unit-testing
- * integration with other scheduling tools (airflow...)
  * integration with other resource provisioning tools (kubernetes...)
  * adding type annotations to code and type checks to CI
  * automatic pulling/pushing data from s3 to local (sampled) for local development

diff --git a/docs/images/AirflowLogo.png b/docs/images/AirflowLogo.png
diff --git a/docs/images/Apache_Spark_logo.svg.png b/docs/images/Apache_Spark_logo.svg.png
diff --git a/docs/images/DuckDB_Logo.png b/docs/images/DuckDB_Logo.png
diff --git a/docs/images/Pandas_logo.svg.png b/docs/images/Pandas_logo.svg.png
diff --git a/docs/images/logo_full_2_transp.png b/docs/images/logo_full_2_transp.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,7 +1,7 @@
 Welcome to Yaetos' documentation!
 =================================
 
-.. image:: ./images/logo_full.jpeg
+.. image:: ./images/logo_full_2_transp.png
     :width: 300
     :alt: Logo
     :align: center