Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate cumulus-etl performance #109

Open
mikix opened this issue Dec 22, 2022 · 5 comments
Open

Investigate cumulus-etl performance #109

mikix opened this issue Dec 22, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@mikix
Copy link
Contributor

mikix commented Dec 22, 2022

Performance has not been a focus so far, but we should investigate for easy wins and to scope out larger tasks that would help.

From observing it run, it's mostly CPU-bound right now. The investigator should probably do some profiling to see where we are spending that time.

Some thoughts below.

Parallelizing

It can probably be a lot more parallelized, to take advantage of the many cores frequently available in cloud computing. Right now a task on a full set of data is quite slow.

Thoughts:

  • tasks could be run simultaneously (this didn't seem to speed us up as much as hoped, especially if we tried to keep to a constant memory usage)
  • converting i2b2 to FHIR doesn't have inter-dependencies between rows

I2b2 Conversion

Hopefully this isn't a huge factor going forward, but it's possible that creating all the FHIR objects and validating them is slowing us down.

This was true! We took out the FHIR object creation and got ~50% faster.

cTAKES

Look into improvements that Andy has for a rewritten cTAKES engine.

@mikix mikix added the enhancement New feature or request label Dec 22, 2022
@mikix mikix changed the title Make cumulus-etl more parallizable Investigate cumulus-etl performance Dec 23, 2022
@mikix
Copy link
Contributor Author

mikix commented Jan 20, 2023

I started looking at where we spend time in the code, but realized that even if I optimize it, it's pointless if we're still just using one core. So that's an obvious first step: using multiple cores for tasks. Breakout issue: #154

@mikix
Copy link
Contributor Author

mikix commented Jan 26, 2023

While investigating, I discovered that ever since we started using the MS de-id tool, we don't really need the internal validation provided by fhirclient. By skipping that de-serialization and re-serialization, we take 30% of the time as we used to (for the core CPU-bound tables).

PR here: #157

This is such a change in our performance profile, I'm going to do more timing testing. But it no longer seems as clear that using a single-core is our biggest issue. The biggest consumer of wall clock time seems to now be Delta Lake, which does use multiple cores.

So ETL is reading data as fast as it can, and shipping that to Delta Lake, which uses multiple cores. So we are kind of multi-processing already. But still worth investigating if concurrency can improve us further.

@mikix
Copy link
Contributor Author

mikix commented Jan 26, 2023

Just landed #158, which does the same fhirclient purge, but for i2b2.

@mikix
Copy link
Contributor Author

mikix commented Feb 2, 2023

I finished up some investigation into multi-threading in #154 and came to the conclusion that it's not worth it right now (See #154 (comment)).

The next win that I think is most likely is sending multiple requests to cTAKES at once. My current thinking is that we could change ctakesclient to use asyncio and then leverage that in cumulus to send multiple requests and wait on them. But I have not done any testing there.

@mikix
Copy link
Contributor Author

mikix commented Feb 2, 2023

Another idea: breakout ticket #164 to do bulk download requests in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant