Improve augment request handling to batch requests rather than running them serially #162

seasidesparrow · 2021-06-28T11:41:44Z

Currently, master pipeline is receiving and processing augment pipeline requests serially, so that only one celery worker is handling requests on both augment and master pipelines. We should also use the load-only argument to avoid loading and sending the fulltext field.

Discussion from Slack (SMD+MT):

SMD: I think we could easily speed up this process. It looks like bibcodes are sent one at a time to augment. this incurs the overhead of queueing a huge number of times. If app.request_aff_augment could handle a list of bibcodes it could package up the list of requests into a list protobuf object: https://github.com/adsabs/ADSMasterPipeline/blob/41f874a33915b1f972b938316954849e3f2f1070/adsmp/app.py#L486 https://github.com/adsabs/ADSPipelineMsg/blob/master/specs/augmentrecord.proto#L15 app.request_aff_augment call to get_record should pass the optional load_only argument since it only needs bib data and fulltext is big. If that doesn't help enough, we can request multiple database records at once. We can also have run.py simply queue batches bibcodes and use workers to read data from postgres and send off the augment request.

MT: That makes sense according to what I saw on the container: Without making use of the delay function in ADSAffil.tasks, the load was about 0.7, which sounds about right for single-threaded operation. With the delay function, load went up to about 2.2, which again makes sense if the receive, augment, and update queues are all running simultaneously. And it also makes sense that adjusting the number of workers within augment_pipeline makes no difference.

The text was updated successfully, but these errors were encountered:

seasidesparrow · 2022-02-08T15:06:20Z

Discussion with SBC & RC in early Feb 2022 suggest that not loading fulltext may provide a larger speed enhancement than bundling multiple records into List protobufs.

seasidesparrow assigned spacemansteve and seasidesparrow Jun 28, 2021

seasidesparrow unassigned spacemansteve Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve augment request handling to batch requests rather than running them serially #162

Improve augment request handling to batch requests rather than running them serially #162

seasidesparrow commented Jun 28, 2021

seasidesparrow commented Feb 8, 2022

Improve augment request handling to batch requests rather than running them serially #162

Improve augment request handling to batch requests rather than running them serially #162

Comments

seasidesparrow commented Jun 28, 2021

seasidesparrow commented Feb 8, 2022