Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve augment request handling to batch requests rather than running them serially #162

Open
seasidesparrow opened this issue Jun 28, 2021 · 1 comment
Assignees

Comments

@seasidesparrow
Copy link
Member

Currently, master pipeline is receiving and processing augment pipeline requests serially, so that only one celery worker is handling requests on both augment and master pipelines. We should also use the load-only argument to avoid loading and sending the fulltext field.

Discussion from Slack (SMD+MT):

SMD: I think we could easily speed up this process. It looks like bibcodes are sent one at a time to augment. this incurs the overhead of queueing a huge number of times. If app.request_aff_augment could handle a list of bibcodes it could package up the list of requests into a list protobuf object: https://github.com/adsabs/ADSMasterPipeline/blob/41f874a33915b1f972b938316954849e3f2f1070/adsmp/app.py#L486 https://github.com/adsabs/ADSPipelineMsg/blob/master/specs/augmentrecord.proto#L15 app.request_aff_augment call to get_record should pass the optional load_only argument since it only needs bib data and fulltext is big. If that doesn't help enough, we can request multiple database records at once. We can also have run.py simply queue batches bibcodes and use workers to read data from postgres and send off the augment request.

MT: That makes sense according to what I saw on the container: Without making use of the delay function in ADSAffil.tasks, the load was about 0.7, which sounds about right for single-threaded operation. With the delay function, load went up to about 2.2, which again makes sense if the receive, augment, and update queues are all running simultaneously. And it also makes sense that adjusting the number of workers within augment_pipeline makes no difference.

@seasidesparrow
Copy link
Member Author

Discussion with SBC & RC in early Feb 2022 suggest that not loading fulltext may provide a larger speed enhancement than bundling multiple records into List protobufs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants