Skip to content

Commit

Permalink
feat: api to s3 (#299)
Browse files Browse the repository at this point in the history
* feat: api to s3

* gzip content

* check if already exist
  • Loading branch information
polomarcus authored Dec 13, 2024
1 parent 6a08b74 commit fed88aa
Show file tree
Hide file tree
Showing 9 changed files with 523 additions and 9 deletions.
13 changes: 9 additions & 4 deletions .github/workflows/deploy-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,11 +55,16 @@ jobs:
- name: Push mediatree_import Image
run: docker push --all-tags ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/mediatree_import

- name: Build ingest_to_db image
run: docker build -f Dockerfile_ingest . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db
- name: Push ingest_to_db Image
run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db
# Not used anymore
# - name: Build ingest_to_db image
# run: docker build -f Dockerfile_ingest . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db
# - name: Push ingest_to_db Image
# run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db

- name: Build s3 image
run: docker build -f Dockerfile_api_to_s3 . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3
- name: Push s3 Image
run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3
- name: update scaleway job definition with version mediatree_import
uses: jawher/[email protected]
env:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
secrets/pwd_api.txt
secrets/username_api.txt
secrets/*
documents-experts/
cc-bio.json
*.xlsx
Expand Down
47 changes: 47 additions & 0 deletions Dockerfile_api_to_s3
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#from https://medium.com/@albertazzir/blazing-fast-python-docker-builds-with-poetry-a78a66f5aed0
FROM python:3.12.7 as builder

ENV VIRTUAL_ENV=/app/.venv

ENV POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_IN_PROJECT=1 \
POETRY_VIRTUALENVS_CREATE=1 \
POETRY_CACHE_DIR=/tmp/poetry_cache

WORKDIR /app

COPY pyproject.toml poetry.lock ./

RUN pip install poetry==1.8.3

RUN poetry install

# The runtime image, used to just run the code provided its virtual environment
FROM python:3.12.7-slim as runtime

WORKDIR /app

ENV VIRTUAL_ENV=/app/.venv
ENV PATH="/app/.venv/bin:$PATH"
ENV PATH="$PYENV_ROOT/bin:$PATH"
ENV PYTHONPATH=/app

COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}

# App code is include with docker-compose as well
COPY quotaclimat ./quotaclimat
COPY postgres ./postgres
COPY pyproject.toml pyproject.toml
COPY alembic/ ./alembic
COPY alembic.ini ./alembic.ini
COPY transform_program.py ./transform_program.py

# healthcheck
EXPOSE 5050

# Use a separate script to handle migrations and start the application
COPY docker-entrypoint.sh ./docker-entrypoint.sh
RUN chmod +x ./docker-entrypoint.sh


ENTRYPOINT ["python", "quotaclimat/data_processing/mediatree/s3/api_to_s3.py"]
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,16 @@ Program data will not be updated to avoid lock concurrent issues when using `UPD

**With the docker-entrypoint.sh this command is done automatically, so for production uses, you will not have to run this command.**

# Mediatre to S3
For a security nets, we have configured at data pipeline from Mediatree API to S3 (Object Storage Scaleway).

Env variable used :
* START_DATE (unixtimestamp such as mediatree service)
* CHANNEL (such as mediatree service)
* BUCKET : Scaleway Access key
* BUCKET_SECRET : Scaleway Secret key
* BUCKET_NAME

## Production monitoring
* Use scaleway
* Use [Ray dashboard] on port 8265
Expand Down
66 changes: 62 additions & 4 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,18 @@ services:
POSTGRES_PORT: 5432
COMPARE_DURATION: "true"
MODIN_ENGINE: ray
MEDIATREE_USER : /run/secrets/username_api
MEDIATREE_PASSWORD: /run/secrets/pwd_api
BUCKET: /run/secrets/bucket
BUCKET_NAME: mediatree
BUCKET_SECRET: /run/secrets/bucket_secret
MODIN_CPUS: 4 # "https://modin.readthedocs.io/en/0.11.0/using_modin.html#reducing-or-limiting-the-resources-modin-can-use"
tty: true # colorize terminal
secrets:
- pwd_api
- username_api
- bucket
- bucket_secret
volumes:
- ./quotaclimat/:/app/quotaclimat/
- ./postgres/:/app/postgres/
Expand Down Expand Up @@ -176,6 +186,50 @@ services:
postgres_db:
condition: service_healthy

api_to_s3:
ports:
- 5666:5666
- 8265:8265
build:
context: ./
dockerfile: Dockerfile_api_to_s3
environment:
ENV: docker # change me to prod for real cases
LOGLEVEL: DEBUG # Change me to info (debug, info, warning, error) to have less log
PYTHONPATH: /app
PORT_HS: 5666 # healthcheck
HEALTHCHECK_SERVER: "0.0.0.0"
# SENTRY_DSN: prod_only
#END_DATE: "2024-02-29" # optional - otherwise end of the month
# START_DATE: 1727610071 # to test batch import
CHANNEL : fr3-idf # to reimport only one channel
MEDIATREE_USER : /run/secrets/username_api
MEDIATREE_PASSWORD: /run/secrets/pwd_api
BUCKET: /run/secrets/bucket
BUCKET_SECRET: /run/secrets/bucket_secret
BUCKET_NAME: mediatree
MEDIATREE_AUTH_URL: https://keywords.mediatree.fr/api/auth/token/
KEYWORDS_URL: https://keywords.mediatree.fr/api/subtitle/ # https://keywords.mediatree.fr/docs/#api-Subtitle-SubtitleList
MODIN_ENGINE: ray
MODIN_CPUS: 4 # "https://modin.readthedocs.io/en/0.11.0/using_modin.html#reducing-or-limiting-the-resources-modin-can-use"
MODIN_MEMORY: 1000000000 # 1Gb
RAY_memory_usage_threshold: 1
mem_limit: "1G"
volumes:
- ./quotaclimat/:/app/quotaclimat/
- ./postgres/:/app/postgres/
- ./test/:/app/test/
secrets:
- pwd_api
- username_api
- bucket
- bucket_secret
depends_on:
nginxtest:
condition: service_healthy
postgres_db:
condition: service_healthy

metabase:
container_name: metabase_barometre
image: metabase/metabase:latest
Expand All @@ -197,7 +251,11 @@ services:
condition: service_healthy

secrets: # https://docs.docker.com/compose/use-secrets/
pwd_api:
file: secrets/pwd_api.txt
username_api:
file: secrets/username_api.txt
pwd_api:
file: secrets/pwd_api.txt
username_api:
file: secrets/username_api.txt
bucket:
file: secrets/scw_bucket.txt
bucket_secret:
file: secrets/scw_bucket_secret.txt
Loading

1 comment on commit fed88aa

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
postgres
   insert_data.py43784%36–38, 56–58, 63
   insert_existing_data_example.py19384%25–27
postgres/schemas
   models.py1681193%137–144, 157, 159–160, 225–226, 240–241
quotaclimat/data_ingestion
   scrap_sitemap.py1341787%27–28, 33–34, 66–71, 95–97, 138–140, 202, 223–228
quotaclimat/data_ingestion/ingest_db
   ingest_sitemap_in_db.py553733%21–42, 45–58, 62–73
quotaclimat/data_ingestion/scrap_html
   scrap_description_article.py36392%19–20, 32
quotaclimat/data_processing/mediatree
   api_import.py21313338%44–48, 53–74, 78–81, 87, 90–132, 138–153, 158, 171–183, 187–193, 206–218, 221–225, 231, 269–270, 273–304, 307–309
   channel_program.py1625765%21–23, 34–36, 53–54, 57–59, 98–99, 108, 124, 175–216
   config.py15287%7, 16
   detect_keywords.py2521694%111–118, 126–127, 271, 341–348, 390
   update_pg_keywords.py674927%15–130, 154, 157, 164–179, 213–250, 257
   utils.py792568%29–53, 56, 65, 86–87, 117–120
quotaclimat/data_processing/mediatree/s3
   api_to_s3.py1337941%68–86, 89–97, 100–149, 152–178, 181–183
quotaclimat/utils
   healthcheck_config.py291452%22–24, 27–38
   logger.py241154%22–24, 28–37
   sentry.py11282%22–23
TOTAL146746668% 

Tests Skipped Failures Errors Time
104 0 💤 0 ❌ 0 🔥 7m 45s ⏱️

Please sign in to comment.