Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation engine #6

Merged
merged 11 commits into from
Jan 17, 2025
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
/.env
.idea
*.log
.cache
data/php/dataset_structured
data/php/run_structured
data/php/log

config/email-server/imapsql.db-shm
config/email-server/imapsql.db-wal
99 changes: 91 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,21 @@ Overview of all OpenML components including a docker-compose to run OpenML servi
![OpenML Component overview](https://raw.githubusercontent.com/openml/services/main/documentation/OpenML-overview.png)

## Prerequisites
- Linux/MacOS/Windows (should all work)
- Linux/MacOS with Intell processor (because of our old ES version, this project currently does not support `arm` architectures)
- [Docker](https://docs.docker.com/get-docker/)
- [Docker Compose](https://docs.docker.com/compose/install/) version 2.21.0 or higher

## Usage

When using this project for the first time, run:
```bash
chown -R www-data:www-data data/php
# Or, if previous fails, for instance because `www-data` does not exist:
chmod -R 777 data/php
```
This is necessary to make sure that you can upload datasets, tasks and runs. Note that the dataset data is meant to be public anyway, so a 777 should not be problematic. This step won't be necessary anymore once the backend stores its files on MinIO.


You run all OpenML services locally using
```bash
docker compose --profile all up -d
Expand All @@ -25,10 +34,11 @@ docker compose --profile all down
You can use different profiles:

- `[no profile]`: databases
- `"elasticsearch"`: databases + elasticsearch
- `"rest-api"`: databases + elasticsearch + REST API
- `"frontend"`: databases + elasticsearch + REST API + frontend + email-server
- `"minio"`: database + elasticsearch + REST APP + MinIO + parquet and croissant conversion
- `"elasticsearch"`: databases + nginx + elasticsearch
- `"rest-api"`: databases + nginx + elasticsearch + REST API
- `"frontend"`: databases + nginx + elasticsearch + REST API + frontend + email-server
- `"minio"`: databases + nginx + elasticsearch + REST APP + MinIO + parquet and croissant conversion
- `"evaluation-engine"`: databases + nginx + elastichsearc + REST API + MinIO + evaluation engine
- `"all"`: everything

Usage examples:
Expand All @@ -53,12 +63,12 @@ docker exec -it openml-php-rest-api /bin/bash # go into the php rest api conta

## Endpoints
> [!TIP]
> If you change any port, make sure to change it for all services! The elasticsearch config, for instance, needs to know the port of the frontend (for CORS).
> If you change any port, make sure to change it for all services!

When you spin up the docker-compose, you'll get these endpoints:
- *Frontend*: localhost:5000
- *Frontend*: localhost:8000
- *Database*: localhost:3306, filled with test data.
- *ElasticSearch*: localhost:9200, filled with test data.
- *ElasticSearch*: localhost:9200 or localhost:8000/es, filled with test data.
- *Rest API*: localhost:8080
- *Minio*: console at localhost:9001, filled with test data.

Expand Down Expand Up @@ -104,5 +114,78 @@ FRONTEND_CODE_DIR=/path/to/openml.org # Python directory of https://githu
FRONTEND_APP=/app # Always set this to /app. Leave empty if you leave FRONTEND_CODE_DIR empty
```

### Python

You can run the openml-python code on your own local server now!

```bash
docker run --rm -it -v ./config/python/config:/root/.config/openml/config:ro --network openml-services openml/openml-python
```


For an example of manual tests, you can run:
```python

import openml
from openml.tasks import TaskType
from openml.datasets.functions import create_dataset
import pandas as pd
import numpy as np


df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df["class"] = ["test" if np.random.randint(0, 1) == 0 else "test2" for _ in range(100)]
df["class"] = df["class"].astype("category")

dataset = create_dataset(
name="test_dataset",
description="test",
creator="I",
contributor=None,
collection_date="now",
language="en",
attributes="auto",
ignore_attribute=None,
citation="citation",
licence="BSD (from scikit-learn)",
default_target_attribute="class",
data=df,
version_label="test",
original_data_url="https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html",
paper_url="url",
)
dataset.publish()

# Meanwhile you can admire your newly created dataset at http://localhost:8000/search?type=data&id=[dataset.id]
# Wait a minute until dataset is active

my_task = openml.tasks.create_task(
task_type=TaskType.SUPERVISED_CLASSIFICATION,
dataset_id=dataset.id,
target_name="class",
evaluation_measure="predictive_accuracy",
estimation_procedure_id=1,
)
my_task.publish()

# wait a minute, so that the dataset and tasks are both processed by the evaluation engine.
# the evaluation engine runs every minute.
# Meanwhile you can check out the newly created task at localhost:8000/search?type=task&id=[my_task.id]

my_task = openml.tasks.get_task(my_task.task_id)
from sklearn import compose, ensemble, impute, neighbors, preprocessing, pipeline, tree
clf = tree.DecisionTreeClassifier()
run = openml.runs.run_model_on_task(clf, my_task)
run.publish()

# wait a minute, so the the run is processed by the evaluation engine

run = openml.runs.get_run(run.id, ignore_cache=True)
run.evaluations

# Expected: {'average_cost': 0.0, 'f_measure': 1.0, 'kappa': 1.0, 'mean_absolute_error': 0.0, 'mean_prior_absolute_error': 0.0, 'number_of_instances': 100.0, 'precision': 1.0, 'predictive_accuracy': 1.0, 'prior_entropy': 0.0, 'recall': 1.0, 'root_mean_prior_squared_error': 0.0, 'root_mean_squared_error': 0.0, 'total_cost': 0.0}
```


### Other services
If you want to develop a service that depends on any of the services in this docker-compose, just bring up this docker-compose and point your service to the correct endpoints.
1 change: 0 additions & 1 deletion config/arff-to-pq-converter/.env
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,4 @@ MINIO_SERVER=minio:9000
MINIO_ACCESS_KEY=admin
MINIO_SECRET_KEY=adminpassword
MINIO_SECURE=False
OPENML_SERVER=http://php-api:80/api/v1/xml
OPENML_DATASET_OFFSET=0
2 changes: 1 addition & 1 deletion config/arff-to-pq-converter/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from openml/arff-to-pq-to-minio
FROM openml/openml-arff-to-pq:v1.0.20241110

COPY cron /etc/cron.d/openml
COPY run-cron.sh /run-cron.sh
Expand Down
1 change: 1 addition & 0 deletions config/arff-to-pq-converter/config
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
server=http://nginx:80/api/v1/xml
2 changes: 1 addition & 1 deletion config/arff-to-pq-converter/cron
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* * * * * /usr/local/bin/python /app/main.py --latest >> /home/unprivileged-user/cron.log 2>&1
* * * * * /usr/local/bin/python /app/main.py --latest >> /home/unprivileged-user/logs/cron.log 2>&1
2 changes: 1 addition & 1 deletion config/arff-to-pq-converter/run-cron.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@
printenv | grep -v HOME >> /etc/environment

touch /home/unprivileged-user/cron.log
chown unprivileged-user:unprivileged-user /home/unprivileged-user/cron.log
chown -R unprivileged-user:unprivileged-user /home/unprivileged-user
/usr/sbin/cron -l 4 && tail -f /home/unprivileged-user/cron.log
2 changes: 1 addition & 1 deletion config/croissant-converter/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from openml/croissant-converter
FROM openml/croissant-converter:v1.1.20241018

COPY cron /etc/cron.d/openml
COPY run-cron.sh /run-cron.sh
Expand Down
2 changes: 1 addition & 1 deletion config/croissant-converter/cron
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* * * * * /generate_croissants.sh >> /home/unprivileged-user/cron.log 2>&1
* * * * * /generate_croissants.sh >> /home/unprivileged-user/logs/cron.log 2>&1
24 changes: 22 additions & 2 deletions config/database/update.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,27 @@
# Change the filepath of openml.file
# from "https://www.openml.org/data/download/1666876/phpFsFYVN"
# to "http://minio:9000/datasets/0000/0001/phpFsFYVN"
mysql -hdatabase -uroot -pok -e 'UPDATE openml.file SET filepath = CONCAT("http://minio:9000/datasets/0000/", LPAD(id, 4, "0"), "/", SUBSTRING_INDEX(filepath, "/", -1));'
mysql -hdatabase -uroot -pok -e 'UPDATE openml.file SET filepath = CONCAT("http://minio:9000/datasets/0000/", LPAD(id, 4, "0"), "/", SUBSTRING_INDEX(filepath, "/", -1)) WHERE extension="arff";'

# Update openml.expdb.dataset with the same url
mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;'
mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;'





# Create the data_feature_description TABLE. TODO: can we make sure this table exists already?
mysql -hdatabase -uroot -pok -Dopenml_expdb -e 'CREATE TABLE IF NOT EXISTS `data_feature_description` (
`did` int unsigned NOT NULL,
`index` int unsigned NOT NULL,
`uploader` mediumint unsigned NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`description_type` enum("plain", "ontology") NOT NULL,
`value` varchar(256) NOT NULL,
KEY `did` (`did`,`index`),
CONSTRAINT `data_feature_description_ibfk_1` FOREIGN KEY (`did`, `index`) REFERENCES `data_feature` (`did`, `index`) ON DELETE CASCADE ON UPDATE CASCADE
)'

# SET dataset 1 to active (used in unittests java)
mysql -hdatabase -uroot -pok -Dopenml_expdb -e 'INSERT IGNORE INTO dataset_status VALUES (1, "active", "2024-01-01 00:00:00", 1)'
mysql -hdatabase -uroot -pok -Dopenml_expdb -e 'DELETE FROM dataset_status WHERE did = 2 AND status = "deactivated";'
7 changes: 1 addition & 6 deletions config/elasticsearch/.env
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
ELASTIC_PASSWORD=default
discovery.type=single-node
xpack.security.enabled=false
http.cors.allow-origin=http://localhost:5000
http.cors.enabled=true
http.cors.allow-credentials=true
http.cors.allow-methods=OPTIONS, POST
http.cors.allow-headers=X-Requested-With, X-Auth-Token, Content-Type, Content-Length, Authorization, Access-Control-Allow-Headers, Accept
xpack.security.enabled=false
Binary file modified config/email-server/imapsql.db
Binary file not shown.
Binary file removed config/email-server/imapsql.db-shm
Binary file not shown.
Binary file removed config/email-server/imapsql.db-wal
Binary file not shown.
4 changes: 4 additions & 0 deletions config/evaluation-engine/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
CONFIG=api_key=AD000000000000000000000000000000;server=http://php-api:80/
JAVA=/usr/local/openjdk-11/bin/java
JAR=/usr/local/lib/evaluation-engine.jar
LOG_DIR=/home/unprivileged-user/logs
11 changes: 11 additions & 0 deletions config/evaluation-engine/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from openml/evaluation-engine:v1.0.20241025

COPY cron /etc/cron.d/openml
COPY run-cron.sh /run-cron.sh

USER root
RUN apt update && apt upgrade -y
RUN apt -y install cron
RUN chmod +x /etc/cron.d/openml

RUN crontab -u unprivileged-user /etc/cron.d/openml
4 changes: 4 additions & 0 deletions config/evaluation-engine/cron
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
* * * * * $JAVA -jar $JAR -config $CONFIG -f process_dataset -v >> ${LOG_DIR}/process_dataset.log 2>&1
* * * * * $JAVA -jar $JAR -config $CONFIG -f evaluate_run -v >> ${LOG_DIR}/evaluate_run.log 2>&1
* * * * * $JAVA -jar $JAR -config $CONFIG -f extract_features_simple -v >> ${LOG_DIR}/extract_features_simple.log 2>&1
*/10 * * * * $JAVA -jar $JAR -config $CONFIG -f extract_features_all -v >> ${LOG_DIR}/extract_features_all.log 2>&1
7 changes: 7 additions & 0 deletions config/evaluation-engine/run-cron.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

printenv | grep -v HOME >> /etc/environment

touch /home/unprivileged-user/cron.log
chown unprivileged-user:unprivileged-user /home/unprivileged-user/cron.log
/usr/sbin/cron -l 4 && tail -f /home/unprivileged-user/cron.log
7 changes: 0 additions & 7 deletions config/frontend/.env
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,3 @@ SMTP_PASS=test
SMTP_USE_TLS=True
DATABASE_URI="mysql+pymysql://root:ok@database:3306/openml"
TESTING=False






m2srmg=WD7-w52'
8 changes: 8 additions & 0 deletions config/nginx/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
FROM nginx:alpine

WORKDIR /etc/nginx
COPY ./nginx.conf ./conf.d/default.conf
COPY ./shared.conf ./shared.conf
EXPOSE 80
ENTRYPOINT [ "nginx" ]
CMD [ "-g", "daemon off;" ]
44 changes: 44 additions & 0 deletions config/nginx/nginx.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# See https://dev.to/danielkun/nginx-everything-about-proxypass-2ona#let-nginx-start-even-when-not-all-upstream-hosts-are-available for reason of weird regexes:
# we want NGINX to go up even when some of the locations are not reachable, and we want the subpath to be given to the upstream.





server {
listen 80;
server_name localhost;
resolver 127.0.0.11;

location ~ ^/es(?:\/(.*))?$ {
include /etc/nginx/shared.conf;
set $upstream_es http://elasticsearch:9200;
proxy_pass $upstream_es/$1$is_args;
}


location ~ ^/api_splits/(.*)$ {
include /etc/nginx/shared.conf;
set $upstream_api http://php-api:80/api_splits;
proxy_pass $upstream_api/$1$is_args;
}

location ~ ^/api(?:\/(.*))$ {
include /etc/nginx/shared.conf;
set $upstream_api http://php-api:80/api;
proxy_pass $upstream_api/$1$is_args;
}

location ~ ^/data/(.*)$ {
include /etc/nginx/shared.conf;
set $upstream_data http://minio:9000;
proxy_pass $upstream_data/$1$is_args;
}

location ~ ^(?:\/(.*))?$ {
include /etc/nginx/shared.conf;
set $upstream_f http://frontend:5000;
proxy_pass $upstream_f/$1$is_args;
}

}
4 changes: 4 additions & 0 deletions config/nginx/shared.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
1 change: 1 addition & 0 deletions config/php/.env
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
API_KEY=AD000000000000000000000000000000
BASE_URL=http://php-api:80/
MINIO_URL=http://minio:9000/
DB_HOST_OPENML=database:3306
Expand Down
2 changes: 2 additions & 0 deletions config/python/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
apikey=AD000000000000000000000000000000
server=http://nginx:80/api/v1/xml
Empty file added data/php/.gitkeep
Empty file.
Loading