-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Celery spider #156
Draft
cronosnull
wants to merge
37
commits into
dmwm:master
Choose a base branch
from
cronosnull:CelerySpider
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[WIP]Celery spider #156
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A docker compose configuration for a celery based spider. The tests/celery_test.py script query all the schedds queues and send messages to the test broker. # How to run: ```bash docker-compose up --scale spider-worker=3 ```
As we changed the user id in the dockerfile before the rebase, we should update the docker-compose file too.
Adding the support for the history queries.
This modules are being replaced by the celery tasks.
* Adding support to ES * [Celery]Adding some documentation and small style changes. * [Celery][ES] Fixing the index assignment. * [Celery][ES]Moving the post ads to a new task In order to improve the performance (related to the AMQ frequency), we can externalize the process to send the data to ES, as it can affect the rate we sent data to AMQ otherwise. * [Celery][ES]Changing the format of the conf file
It turns out that, after a restart of the worker process, the first message was tried to be serialized as json and fail because classads are not json serializable. Explicitly setting the serializer to the tasks prevent this.
K8S (and openstack) add some environment variables to the containers, so we need to change the naming to avoid conflicts (e.g. flower uses a FLOWER_PORT variable, that conflicts with the the {POD_NAME}_PORT variable in K8S)
cronosnull
force-pushed
the
CelerySpider
branch
4 times, most recently
from
May 28, 2020 18:57
e26d6dc
to
f0073b6
Compare
(black applied)
cronosnull
force-pushed
the
CelerySpider
branch
4 times, most recently
from
June 2, 2020 09:45
4020976
to
8099372
Compare
Creation of the share folder and cleanup of the environment in the affiliation cron (as it will not need most of the secrets)
As it was before, the affiliation manager will be created on first module load (which will be cause problems in the k8s setup as it may not exists)
Reducing the required resources, adding a shared redis storate, adding RuntimeError exception to retry for in the query schedd.
For development purposes this is convenient/
It turns out that the problem in the worker was caused by flower configuration.
cronosnull
force-pushed
the
CelerySpider
branch
3 times, most recently
from
June 4, 2020 14:33
42fe429
to
25cf493
Compare
Using multiple queues the tasks from different types can be served in parallel.
cronosnull
force-pushed
the
CelerySpider
branch
2 times, most recently
from
June 4, 2020 21:30
c969024
to
83516ca
Compare
Making part of the process synchronous (parallelize ony by schedd) will create less and small messages, and with a small number of workers will have a better performance.
cronosnull
force-pushed
the
CelerySpider
branch
3 times, most recently
from
June 9, 2020 14:03
f562be7
to
920bb35
Compare
In order to remove the most common warning from the log we can try to solve it (and then, ignore it).
Rolling back to prefetch multiplier 1 (the waiting time of the messages is lower as the task duration is uneven)
This will make easier to update the schema in the docker image as it it will be automatically updated on image build (that is triggered by commit into the repository).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The celery version of the spider is a rewrite of the current spider using Celery/Redis to make it scalable in a containerized environment (such as k8s).
The docker-compose file will create the appropriate services and run the application. Before running, you will need to adjust the paths of the secret section.
Design decisions:
Each worker container will run only one Celery Worker. Scale creating new pods.
Work in progress, it still needs:
For testing purposes, using 4 spider workers, and one instance of each the other pods, we get similar execution times to the current spider.