The project is based on poetry for managing dependencies.
You should have poetry installed on your local machine. You can follow the instruction on https://python-poetry.org.
After you have poetry installed you can install the project's dependencies run:
poetry install
This project uses ruff
for formatting and linting.
Run
poetry run ruff format
to format, and
poetry run ruff check
to check.
Please install a PostgreSQL database locally. For example, on a macOS platform, execute:
brew install postgresql
Note: If any errors occur during pytest runs, please stop the local database service by executing:
brew services stop postgresql
Tests are done with pytest
Run
poetry run pytest -v
The OpenAPI specs are available under http://localhost:9000/docs
We use alembic as engine to manage database migrations.
Database migrations are automatically managed by alembic in docker-compose file. It will connect to the database and apply the migrations defined in alembic folder.
We use sqlalchemy to define the database tables. All table clas are stored in ./app/db/tables/ folder.
If you have updated a class inside the above folder, you can use the docker-compose file to generate the new migration file by using alembic command:
docker compose run migrations /bin/sh -c "alembic revision --autogenerate -m "GIVE A NAME TO THIS REVISION""
this will ooutput a new migration file that is going to be used by migrations
Docker image.
If you need to create a new table in the database, you have to create a new python class in the ./app/db/tables/ and you need to run the same command described above:
docker compose run migrations /bin/sh -c "alembic revision --autogenerate -m "GIVE A NAME TO THIS REVISION""
Notes: Alembic watches for all BaseTables classes imported in the env.py file. If you add a new file for the table, be sure to import the class in the env.py file.
By default, the PostgreSQL schema that is used in the platform and in the migration files is public
. If you want to
store data and use another schema, you need to modify the environment variables of the schema in
the docker-compose.yaml
accordingly, and you have to either manually modify migrations script or re-create the
migrations with above commands.
When a reference or current dataset file is uploaded, the platform calculates all available metrics using a Spark job deployed in a Kubernetes cluster. The platform utilizes the spark-on-k8s library to initiate Spark jobs.
A Spark job consists of two main components:
- Spark Driver: Manages the job execution
- Spark Executors: Instances that perform the actual calculations
In the platform, the default resources configuration for Spark jobs is as follows:
- Spark Driver CPUs: 1
- Spark Driver Memory: 1024 (in MB)
- Spark Driver Memory overhead: 512 (in MB)
- Spark Executor CPUs: 1
- Spark Executor Memory: 1024 (in MB)
- Spark Executor Memory overhead: 512 (in MB)
- Spark Executor Initial instances: 2
- Spark Executor Min instances: 2
- Spark Executor Max instances: 2
To optimize performance for larger files or to accelerate computations, you can adjust the following environment variables in the backend container:
SPARK_ON_K8S_DRIVER_CPU: 1
SPARK_ON_K8S_DRIVER_MEMORY: 1024
SPARK_ON_K8S_DRIVER_MEMORY_OVERHEAD: 512
SPARK_ON_K8S_EXECUTOR_CPU: 1
SPARK_ON_K8S_EXECUTOR_MEMORY: 1024
SPARK_ON_K8S_EXECUTOR_MEMORY_OVERHEAD: 512
SPARK_ON_K8S_EXECUTOR_INITIAL_INSTANCES: 2
SPARK_ON_K8S_EXECUTOR_MIN_INSTANCES: 2
SPARK_ON_K8S_EXECUTOR_MAX_INSTANCES: 2
Adjust these variables as needed to allocate more resources or modify the number of executor instances for your specific use case.
If you need for some reasons to add annotations to driver and executor pods, you can adjust the following environment variables in the backend container passing a valid json as in the example:
SPARK_ON_K8S_SPARK_DRIVER_ANNOTATIONS: '{"my.annotation/driver": "my-value"}'
SPARK_ON_K8S_SPARK_EXECUTOR_ANNOTATIONS: '{"my.annotation/executor": "my-value"}'