[WIP]Celery spider #156

cronosnull · 2020-02-05T15:25:16Z

The celery version of the spider is a rewrite of the current spider using Celery/Redis to make it scalable in a containerized environment (such as k8s).

The docker-compose file will create the appropriate services and run the application. Before running, you will need to adjust the paths of the secret section.

Design decisions:
Each worker container will run only one Celery Worker. Scale creating new pods.

Work in progress, it still needs:

For testing purposes, using 4 spider workers, and one instance of each the other pods, we get similar execution times to the current spider.

Remove unused imports

A docker compose configuration for a celery based spider. The tests/celery_test.py script query all the schedds queues and send messages to the test broker. # How to run: ```bash docker-compose up --scale spider-worker=3 ```

As we changed the user id in the dockerfile before the rebase, we should update the docker-compose file too.

Adding the support for the history queries.

This modules are being replaced by the celery tasks.

* Adding support to ES * [Celery]Adding some documentation and small style changes. * [Celery][ES] Fixing the index assignment. * [Celery][ES]Moving the post ads to a new task In order to improve the performance (related to the AMQ frequency), we can externalize the process to send the data to ES, as it can affect the rate we sent data to AMQ otherwise. * [Celery][ES]Changing the format of the conf file

It turns out that, after a restart of the worker process, the first message was tried to be serialized as json and fail because classads are not json serializable. Explicitly setting the serializer to the tasks prevent this.

K8S (and openstack) add some environment variables to the containers, so we need to change the naming to avoid conflicts (e.g. flower uses a FLOWER_PORT variable, that conflicts with the the {POD_NAME}_PORT variable in K8S)

(black applied)

Creation of the share folder and cleanup of the environment in the affiliation cron (as it will not need most of the secrets)

As it was before, the affiliation manager will be created on first module load (which will be cause problems in the k8s setup as it may not exists)

Reducing the required resources, adding a shared redis storate, adding RuntimeError exception to retry for in the query schedd.

For development purposes this is convenient/

It turns out that the problem in the worker was caused by flower configuration.

Using multiple queues the tasks from different types can be served in parallel.

Making part of the process synchronous (parallelize ony by schedd) will create less and small messages, and with a small number of workers will have a better performance.

In order to remove the most common warning from the log we can try to solve it (and then, ignore it).

Rolling back to prefetch multiplier 1 (the waiting time of the messages is lower as the task duration is uneven)

This will make easier to update the schema in the docker image as it it will be automatically updated on image build (that is triggered by commit into the repository).

cronosnull added 12 commits May 28, 2020 15:59

Autoflake

275a351

Remove unused imports

Proof of concept

f5570e1

A docker compose configuration for a celery based spider. The tests/celery_test.py script query all the schedds queues and send messages to the test broker. # How to run: ```bash docker-compose up --scale spider-worker=3 ```

Update the docker-compose file with the user

92c8632

As we changed the user id in the dockerfile before the rebase, we should update the docker-compose file too.

Working POC after the rebase

1417ec1

[Celery] History queries, main file (WIP)

00d8bf0

Adding the support for the history queries.

[Celery][Houskeeping]

b7ac271

This modules are being replaced by the celery tasks.

[Celery]Explicitly setting the serializer

060329d

It turns out that, after a restart of the worker process, the first message was tried to be serialized as json and fail because classads are not json serializable. Explicitly setting the serializer to the tasks prevent this.

[Celery][Docker] Updating names to avoid problems with k8s

17e2935

K8S (and openstack) add some environment variables to the containers, so we need to change the naming to avoid conflicts (e.g. flower uses a FLOWER_PORT variable, that conflicts with the the {POD_NAME}_PORT variable in K8S)

First version of the k8s manifest

c79f382

Spliting the k8s manifest

e5f1855

[Celery]fix for the cron k8s definition

5f29a19

cronosnull force-pushed the CelerySpider branch 4 times, most recently from e26d6dc to f0073b6 Compare May 28, 2020 18:57

[Celery] change the create affiliation to be a task

cc2e839

(black applied)

cronosnull force-pushed the CelerySpider branch 4 times, most recently from 4020976 to 8099372 Compare June 2, 2020 09:45

[Spider][k8s]Environment cleanup in the affiliation cron

1407d99

Creation of the share folder and cleanup of the environment in the affiliation cron (as it will not need most of the secrets)

cronosnull force-pushed the CelerySpider branch from 8099372 to 1407d99 Compare June 2, 2020 10:57

cronosnull added 2 commits June 2, 2020 13:25

[k8s][Celery] Affiliation manager created the first time is run

376fb10

As it was before, the affiliation manager will be created on first module load (which will be cause problems in the k8s setup as it may not exists)

[k8s] Change the required resources

c8d7fb3

Reducing the required resources, adding a shared redis storate, adding RuntimeError exception to retry for in the query schedd.

cronosnull force-pushed the CelerySpider branch from 2c9314e to c8d7fb3 Compare June 2, 2020 14:44

cronosnull added 3 commits June 2, 2020 18:17

[k8s] imagePullPolicy always

4350adc

For development purposes this is convenient/

[Celery]Bug Fix

9075e8a

[Celery]Enabling task compression

aa5d280

cronosnull force-pushed the CelerySpider branch from 1c6de28 to b493954 Compare June 3, 2020 21:01

[Celery][Flower] Using the worker image as flower base

74e63fa

It turns out that the problem in the worker was caused by flower configuration.

cronosnull force-pushed the CelerySpider branch 3 times, most recently from 42fe429 to 25cf493 Compare June 4, 2020 14:33

[Celery]Using multiple queues

38c656a

Using multiple queues the tasks from different types can be served in parallel.

cronosnull force-pushed the CelerySpider branch from 25cf493 to 38c656a Compare June 4, 2020 16:46

[k8s][Celery] updating the k8s flower deployment

6ca7af9

cronosnull force-pushed the CelerySpider branch 2 times, most recently from c969024 to 83516ca Compare June 4, 2020 21:30

[celery] No over parallelism

d124162

Making part of the process synchronous (parallelize ony by schedd) will create less and small messages, and with a small number of workers will have a better performance.

cronosnull force-pushed the CelerySpider branch from 83516ca to d124162 Compare June 4, 2020 21:34

cronosnull added 2 commits June 5, 2020 09:59

[Celery]! Use the start time.

9c3fbad

[Celery]Only send to ES data from the history

3449baf

cronosnull force-pushed the CelerySpider branch 3 times, most recently from f562be7 to 920bb35 Compare June 9, 2020 14:03

cronosnull added 2 commits June 10, 2020 15:12

[Celery]Fixing the most common key error

83996f2

In order to remove the most common warning from the log we can try to solve it (and then, ignore it).

[Celery] Test parameters

e65cbfd

cronosnull force-pushed the CelerySpider branch from 920bb35 to e65cbfd Compare June 11, 2020 17:30

[ConvertToJson] bugfix GLIDEIN Memory

213621f

cronosnull force-pushed the CelerySpider branch from 0b282b0 to 213621f Compare June 11, 2020 18:45

cronosnull added 4 commits June 11, 2020 21:07

[amq] Use logging instead of print

d2847ba

[Celery] Test parameters

3f682e3

Rolling back to prefetch multiplier 1 (the waiting time of the messages is lower as the task duration is uneven)

[Celery] Documentation

c52f534

[Celery] Adding metadata and remove unused imports

05e0521

cronosnull force-pushed the CelerySpider branch from 7168111 to b225e21 Compare June 19, 2020 17:19

[Celery] Log schedd query errors

88c9d7b

cronosnull force-pushed the CelerySpider branch from b225e21 to 88c9d7b Compare June 19, 2020 17:27

cronosnull added 2 commits June 24, 2020 14:51

Adding the schema to the repository root

4d7107d

This will make easier to update the schema in the docker image as it it will be automatically updated on image build (that is triggered by commit into the repository).

[Celery] solve logging bug

73eda00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Celery spider #156

[WIP]Celery spider #156

cronosnull commented Feb 5, 2020 •

edited

Loading

[WIP]Celery spider #156

Are you sure you want to change the base?

[WIP]Celery spider #156

Conversation

cronosnull commented Feb 5, 2020 • edited Loading

cronosnull commented Feb 5, 2020 •

edited

Loading