Add multi-threading in some capacity #154

mikix · 2023-01-20T14:44:29Z

Right now, Cumulus ETL will only ever use one CPU core. This is criminally ill-performant.

I'm guessing it's easiest to just use python threading support. But maybe it alternatively makes sense to do some job queuing thing?
Be careful writing to shared spaces (e.g. the codebook most notably).
It would be easiest to start with just threading tasks against each other (i.e. not worrying about threading inside a task nor non-task work like i2b2 transforms).

This is a break-out of the larger performance issue #109.

mikix · 2023-01-31T19:21:31Z

PR #159 is aimed at solving this.

mikix · 2023-02-02T15:06:39Z

OK, after much testing, I'm going to put this down as inconclusive.

Brief highlights of my testing (which focused on the non-NLP tasks, specifically about 5M of them, generated by synthea -p 10000):

Not a noticeable difference between using multiple cores and using multiple threads
Throwing each task into its own thread saved about 10% of the wall clock time
Throwing batches into threads saved about another 10% (this is basically where we'd read in the next few batches ahead of time but then send them to Delta Lake in serial -- note that we couldn't actually send multiple batches at once for the same table because while Delta Lake technically handled that gracefully, it did so by throwing an error and telling the second batch to retry. But just by parallelizing the read and the write, we got some boost.)
However... those both just traded CPU for memory, because each thread had its own giant batch. When I then tried to share the batch size among the threads, the performance gain got lost (presumably lots of tiny Delta Lake merges is not as good).

At this point, I had spent so much time on this, I just threw the refactoring I had done into a PR (#160) and left out the actual multi-threading.

I think this could be revived in the future, but is not as promising an avenue as I had hoped.

Parallelizing the NLP requests though... I think there's value there. But it doesn't have to be threading -- just sending more than one request to cTAKES at a time, probably. Anyway, that's a separate effort from this ticket, I think. I'll make a note of that in the parent performance investigation ticket and close this.

mikix added the enhancement New feature or request label Jan 20, 2023

mikix self-assigned this Jan 20, 2023

mikix mentioned this issue Jan 20, 2023

Investigate cumulus-etl performance #109

Open

mikix closed this as completed Feb 2, 2023

mikix mentioned this issue Feb 2, 2023

WIP: feat: add some concurrency to our ETL core loop #159

Closed

2 tasks

mikix mentioned this issue Feb 14, 2023

feat: support DocumentReference URL attachments #172

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-threading in some capacity #154

Add multi-threading in some capacity #154

mikix commented Jan 20, 2023

mikix commented Jan 31, 2023

mikix commented Feb 2, 2023

Add multi-threading in some capacity #154

Add multi-threading in some capacity #154

Comments

mikix commented Jan 20, 2023

mikix commented Jan 31, 2023

mikix commented Feb 2, 2023