-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParallelIterable is deadlocking and is generally really complicated #11768
Labels
bug
Something isn't working
Comments
@alexjo2144 had a fix that tries to workaround this bug trinodb/trino#23321, but it's only mitigates effects rather than fixing core issue |
sopel39
added a commit
to sopel39/iceberg
that referenced
this issue
Dec 13, 2024
It was observed that with high concurrency/high workload scenario cluster deadlocks due to manifest readers waiting for connection from S3 pool. Specifically, ManifestGroup#plan will create ManifestReader per every ParallelIterable.Task. These readers will effectively hold onto S3 connection from the pool. When ParallelIterable queue is full, Task will be tabled for later use. Consider scenario: S3 connection pool size=1 approximateMaxQueueSize=1 workerPoolSize=1 ParallelIterable1: starts TaskP1 ParallelIterable1: TaskP1 produces result, queue gets full, TaskP1 is put on hold (holds S3 connection) ParallelIterable2: starts TaskP2, TaskP2 is scheduled on workerPool but is blocked on S3 connection pool ParallelIterable1: result gets consumed, TaskP1 is scheduled again ParallelIterable1: TaskP1 waits for workerPool to be free, but TaskP2 is waiting for TaskP1 to release connection The fix make sure Task is finished once it's started. This way limited resources like connection pool are not put on hold. Queue size might exceed strict limits, but it should still be bounded. Fixes apache#11768
sopel39
added a commit
to sopel39/iceberg
that referenced
this issue
Dec 13, 2024
It was observed that with high concurrency/high workload scenario cluster deadlocks due to manifest readers waiting for connection from S3 pool. Specifically, ManifestGroup#plan will create ManifestReader per every ParallelIterable.Task. These readers will effectively hold onto S3 connection from the pool. When ParallelIterable queue is full, Task will be tabled for later use. Consider scenario: S3 connection pool size=1 approximateMaxQueueSize=1 workerPoolSize=1 ParallelIterable1: starts TaskP1 ParallelIterable1: TaskP1 produces result, queue gets full, TaskP1 is put on hold (holds S3 connection) ParallelIterable2: starts TaskP2, TaskP2 is scheduled on workerPool but is blocked on S3 connection pool ParallelIterable1: result gets consumed, TaskP1 is scheduled again ParallelIterable1: TaskP1 waits for workerPool to be free, but TaskP2 is waiting for TaskP1 to release connection The fix make sure Task is finished once it's started. This way limited resources like connection pool are not put on hold. Queue size might exceed strict limits, but it should still be bounded. Fixes apache#11768
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Apache Iceberg version
1.7.1 (latest release)
Query engine
Trino
Please describe the bug 🐞
ParallelIterable
implementation is really complicated and has subtle concurrency bugs.Context #1
It was observed that with high concurrency/high workload scenario cluster concurrency is reduced to 0 or 1 due to S3
Timeout waiting for connection from pool
errors. Once that starts to happening, it will continue to go on effectively making cluster unusable.Context #2
ManifestGroup#plan
will createManifestReader
per everyParallelIterable.Task
. These readers will effectively hold onto S3 connection from the pool. WhenParallelIterable
queue is full,Task
will be tabled for later use. The number of tasks is not bounded by worker pool size, but ratherX = num(ParallelIterable instances) * size(ParallelIterator#taskFutures)
. One can see thatX
can be significant with high number of concurrent queries.Issue #1
ParallelIterable
is not batch based. This means it will produce read-ahead results even if downstream consumer doesn't have slots for them. This can lead to subtle concurrency issues. For instance consider two parallel iterablesP1, P2
. Let's assume single threaded reader consumes 500 elements fromP1
, thenP2
thenP1
and so on (this could be splits for instance). IfP1
becomes full then it will no longer fetch more elements while holding of tasks (which in turn hold S3 connections). This will prevent fetching of tasks fromP2
from completion (because there are no "free" S3 slots).Consider scenario:
S3 connection pool size=1
approximateMaxQueueSize=1
workerPoolSize=1
P1: starts
TaskP1
P1: produces result, queue full,
TaskP1
put on hold (holds S3 connection)P2: starts
TaskP2
,TaskP2
is scheduled onworkerPool
but is blocked on S3 connection poolP1: result consumed,
TaskP1
is scheduled againP1:
TaskP1
waits forworkerPool
to be free, butTaskP2
is waiting forTaskP1
to release connectionDEADLOCK
Issue #2
Active waiting. This one is a known one. However, if one looks at
ParallelIterable.ParallelIterator#checkTasks
there is:which means active waiting is actually happening though
workerPool
(e.g. task is started on worker pool just to check that queue is full and it should be put on hold).Short term fix?
Once
ParallelIterable.Task
is started it should continue until entire task is consumed. This will prevent putting limited resourcs on hold.if (queue.size() >= approximateMaxQueueSize) {
check should only happen once per task before iterator is created.Long term fix?
Perhaps the code can be refactored to be more readable and streamlined?
cc @findepi @raunaqmorarka
Willingness to contribute
The text was updated successfully, but these errors were encountered: