Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-threading in some capacity #154

Closed
mikix opened this issue Jan 20, 2023 · 2 comments
Closed

Add multi-threading in some capacity #154

mikix opened this issue Jan 20, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@mikix
Copy link
Contributor

mikix commented Jan 20, 2023

Right now, Cumulus ETL will only ever use one CPU core. This is criminally ill-performant.

  • I'm guessing it's easiest to just use python threading support. But maybe it alternatively makes sense to do some job queuing thing?
  • Be careful writing to shared spaces (e.g. the codebook most notably).
  • It would be easiest to start with just threading tasks against each other (i.e. not worrying about threading inside a task nor non-task work like i2b2 transforms).

This is a break-out of the larger performance issue #109.

@mikix mikix added the enhancement New feature or request label Jan 20, 2023
@mikix mikix self-assigned this Jan 20, 2023
@mikix
Copy link
Contributor Author

mikix commented Jan 31, 2023

PR #159 is aimed at solving this.

@mikix
Copy link
Contributor Author

mikix commented Feb 2, 2023

OK, after much testing, I'm going to put this down as inconclusive.

Brief highlights of my testing (which focused on the non-NLP tasks, specifically about 5M of them, generated by synthea -p 10000):

  • Not a noticeable difference between using multiple cores and using multiple threads
  • Throwing each task into its own thread saved about 10% of the wall clock time
  • Throwing batches into threads saved about another 10% (this is basically where we'd read in the next few batches ahead of time but then send them to Delta Lake in serial -- note that we couldn't actually send multiple batches at once for the same table because while Delta Lake technically handled that gracefully, it did so by throwing an error and telling the second batch to retry. But just by parallelizing the read and the write, we got some boost.)
  • However... those both just traded CPU for memory, because each thread had its own giant batch. When I then tried to share the batch size among the threads, the performance gain got lost (presumably lots of tiny Delta Lake merges is not as good).

At this point, I had spent so much time on this, I just threw the refactoring I had done into a PR (#160) and left out the actual multi-threading.

I think this could be revived in the future, but is not as promising an avenue as I had hoped.

Parallelizing the NLP requests though... I think there's value there. But it doesn't have to be threading -- just sending more than one request to cTAKES at a time, probably. Anyway, that's a separate effort from this ticket, I think. I'll make a note of that in the parent performance investigation ticket and close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant