-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-24.2.3-rc: changefeedccl: fix memory leak in cloud storage sink with fast gzip #130625
release-24.2.3-rc: changefeedccl: fix memory leak in cloud storage sink with fast gzip #130625
Conversation
This commit adds a new changefeed testing knob, AsyncFlushSync, which can be used to introduce a synchronization point between goroutines during an async flush. It's currently only used in the cloud storage sink. Epic: none Release note: none
Adds a test that reproduces a memory leak from pgzip, the library used for fast gzip compression for changefeeds using cloud storage sinks. The leak was caused by a race condition between Flush/flushTopicVerions and the async flusher: if the Flush clears files before the async flusher closes the compression codec as part of flushing the files, and the flush returns an error, the compression codec will not be closed properly. This test uses the AsyncFlushSync testing knob to introduce synchronization points between these two goroutines to trigger the regression. Co-authored by: wenyihu6 Epic: none Release note: none
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
b93ae37
to
0be2878
Compare
When using the cloud storage sink with fast gzip and async flush enabled, changefeeds could leak memory from the pgzip library if a write error to the sink occurred. This was due to a race condition when flushing, if the goroutine initiating the flush cleared the files before the async flusher had cleaned up the compression codec and received the error from the sink. This fix clears the files after waiting for the async flusher to finish flushing the files, so that if an error occurs the files can be closed when the sink is closed. Co-authored by: wenyihu6 Epic: none Fixes: cockroachdb#129947 Release note(bug fix): Fixes a potential memory leak in changefeeds using a cloud storage sink. The memory leak could occur if both changefeed.fast_gzip.enabled and changefeed.cloudstorage.async_flush.enabled are true and the changefeed received an error while attempting to write to the cloud storage sink.
0be2878
to
1ef47a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preemptive lgtm
TFTR! I confirmed that the data race seen in tests is pre-existing, see #130651 (comment). I added a skip race to the backport. |
4764661
into
cockroachdb:release-24.2.3-rc
Backport 3/3 commits from #130204.
/cc @cockroachdb/release
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine calling Flush cleared the files before the
async flusher had cleaned up the compression codec and received the
error from the sink.
This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.
See individual commits for more info.
Co-authored by: wenyihu6
Epic: none
Fixes: #129947
Release note(bug fix): Fixes a potential memory leak in changefeeds using a
cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
Release justification: Fixes a bug that could cause OOMs in changefeed cloud storage sinks.