Investigate cumulus-etl performance #109

mikix · 2022-12-22T15:57:37Z

Performance has not been a focus so far, but we should investigate for easy wins and to scope out larger tasks that would help.

From observing it run, it's mostly CPU-bound right now. The investigator should probably do some profiling to see where we are spending that time.

Some thoughts below.

Parallelizing

It can probably be a lot more parallelized, to take advantage of the many cores frequently available in cloud computing. Right now a task on a full set of data is quite slow.

Thoughts:

~~tasks could be run simultaneously~~ (this didn't seem to speed us up as much as hoped, especially if we tried to keep to a constant memory usage)
converting i2b2 to FHIR doesn't have inter-dependencies between rows

I2b2 Conversion

~~Hopefully this isn't a huge factor going forward, but it's possible that creating all the FHIR objects and validating them is slowing us down~~.

This was true! We took out the FHIR object creation and got ~50% faster.

cTAKES

Look into improvements that Andy has for a rewritten cTAKES engine.

mikix · 2023-01-20T14:45:43Z

I started looking at where we spend time in the code, but realized that even if I optimize it, it's pointless if we're still just using one core. So that's an obvious first step: using multiple cores for tasks. Breakout issue: #154

mikix · 2023-01-26T14:53:21Z

While investigating, I discovered that ever since we started using the MS de-id tool, we don't really need the internal validation provided by fhirclient. By skipping that de-serialization and re-serialization, we take 30% of the time as we used to (for the core CPU-bound tables).

PR here: #157

This is such a change in our performance profile, I'm going to do more timing testing. But it no longer seems as clear that using a single-core is our biggest issue. The biggest consumer of wall clock time seems to now be Delta Lake, which does use multiple cores.

So ETL is reading data as fast as it can, and shipping that to Delta Lake, which uses multiple cores. So we are kind of multi-processing already. But still worth investigating if concurrency can improve us further.

mikix · 2023-01-26T18:39:53Z

Just landed #158, which does the same fhirclient purge, but for i2b2.

mikix · 2023-02-02T15:38:49Z

I finished up some investigation into multi-threading in #154 and came to the conclusion that it's not worth it right now (See #154 (comment)).

The next win that I think is most likely is sending multiple requests to cTAKES at once. My current thinking is that we could change ctakesclient to use asyncio and then leverage that in cumulus to send multiple requests and wait on them. But I have not done any testing there.

mikix · 2023-02-02T16:59:20Z

Another idea: breakout ticket #164 to do bulk download requests in parallel.

mikix added the enhancement New feature or request label Dec 22, 2022

mikix changed the title ~~Make cumulus-etl more parallizable~~ Investigate cumulus-etl performance Dec 23, 2022

mikix mentioned this issue Dec 28, 2022

fix: don't load all of an i2b2 file into memory #108

Merged

2 tasks

mikix mentioned this issue Jan 20, 2023

Add multi-threading in some capacity #154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate cumulus-etl performance #109

Investigate cumulus-etl performance #109

mikix commented Dec 22, 2022 •

edited

Loading

mikix commented Jan 20, 2023

mikix commented Jan 26, 2023

mikix commented Jan 26, 2023

mikix commented Feb 2, 2023

mikix commented Feb 2, 2023

Investigate cumulus-etl performance #109

Investigate cumulus-etl performance #109

Comments

mikix commented Dec 22, 2022 • edited Loading

Parallelizing

I2b2 Conversion

cTAKES

mikix commented Jan 20, 2023

mikix commented Jan 26, 2023

mikix commented Jan 26, 2023

mikix commented Feb 2, 2023

mikix commented Feb 2, 2023

mikix commented Dec 22, 2022 •

edited

Loading