-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate cumulus-etl performance #109
Comments
I started looking at where we spend time in the code, but realized that even if I optimize it, it's pointless if we're still just using one core. So that's an obvious first step: using multiple cores for tasks. Breakout issue: #154 |
While investigating, I discovered that ever since we started using the MS de-id tool, we don't really need the internal validation provided by PR here: #157 This is such a change in our performance profile, I'm going to do more timing testing. But it no longer seems as clear that using a single-core is our biggest issue. The biggest consumer of wall clock time seems to now be Delta Lake, which does use multiple cores. So ETL is reading data as fast as it can, and shipping that to Delta Lake, which uses multiple cores. So we are kind of multi-processing already. But still worth investigating if concurrency can improve us further. |
Just landed #158, which does the same fhirclient purge, but for i2b2. |
I finished up some investigation into multi-threading in #154 and came to the conclusion that it's not worth it right now (See #154 (comment)). The next win that I think is most likely is sending multiple requests to cTAKES at once. My current thinking is that we could change |
Another idea: breakout ticket #164 to do bulk download requests in parallel. |
Performance has not been a focus so far, but we should investigate for easy wins and to scope out larger tasks that would help.
From observing it run, it's mostly CPU-bound right now. The investigator should probably do some profiling to see where we are spending that time.
Some thoughts below.
Parallelizing
It can probably be a lot more parallelized, to take advantage of the many cores frequently available in cloud computing. Right now a task on a full set of data is quite slow.
Thoughts:
tasks could be run simultaneously(this didn't seem to speed us up as much as hoped, especially if we tried to keep to a constant memory usage)I2b2 Conversion
Hopefully this isn't a huge factor going forward, but it's possible that creating all the FHIR objects and validating them is slowing us down.This was true! We took out the FHIR object creation and got ~50% faster.
cTAKES
Look into improvements that Andy has for a rewritten cTAKES engine.
The text was updated successfully, but these errors were encountered: