Fix bmalloc hang with RT thread priorities #1408

filipe-norte-red · 2024-09-24T10:24:28Z

Description:

When real time (RT) thread priorities are used for some of the gstreamer pipeline elements, we may run into a situation where several RT threads start spinning during a mutex acquisition process, leading to a system hang as most other threads won't be able to run.

Sequence of events leading up to the hang:

A web process thread (non-RT priority) acquires the mutex lock for the heap and is then involuntary descheduled, and does not run again
vqueue:src (RT priority) enters the lockSlowCase and starts spinning in the while loop
multiqueue0:src (instance 1, RT priority) enters the lockSlowCase and starts spinning in the while loop
aqueue:src (RT priority) enters the lockSlowCase and starts spinning in the while loop
multiqueue0:src (instance 2, RT priority) enters the lockSlowCase and starts spinning in the while loop

Once stage 5 is hit, the box is hung as the only thing that can run on a CPU core is:

one of the above RT threads (aqueue, vqueue, or multiqueue)
any other RT thread with a priority equal or greater than the above RT threads
any h/w irq

The use of the usleep() will allow the low priority process to run and release the mutex lock, avoiding the hang

Author of issue analysis and fix proposal: Steven Webster

Proposed fix / analysis summary:

The proposed fix is to replace the sched_yield() call with a usleep() call. This will guarantee that the calling thread will deschedule for the specified time period, allowing the low priority thread to run and release the mutex lock, avoiding the hang.

This fix also has the benefit of reducing the cpu usage of the threads that enter tight while() loop in lockSlowCase() and spin waiting for the mutex to be released.

An example of how much cpu runtime can be saved is seen by comparing the kernelshark screenshots. The table below shows the actual thread execution time as a percentage of over runtime:

Description	Runtime (us)	Execution time (us)	# loop iterations	% execution time of runtime
sched_yield()	4538	4538	1128	100
usleep(125)	4230	172	27	4.089
usleep(150)	5830	223	27	3.818

The choice of the usleep value is a tradeoff between lower the %execution time of runtime against the usleep time being greater than lockSlowCase() would normally run for.

Two values of usleep were measured:

125us – the overhead (the addition time over 125us the syscall takes due to setup/latency etc) for this value is 22us or 17.5% of usleep time. The measured % of call to lockSlowCase() where the runtime is < 125us is 26.6%
150us – the overhead for this value is 21us or 14% of usleep time. The measured % of call to lockSlowCase() where the runtime is < 150us is 28.3%

NOTE: - the % of calls to lockSlowCase() that were less than either of the usleep values, were measured over a 30min period, so the value could move up/down as the measurement period is increased

The recommended usleep time is 150us as this gives a lower overhead ratio, lower execution to runtime ratio for a small increase in the number of times a thread may block longer than it would originally run for.

Reproduction

To reproduce this issue, the easiest way is to patch gstreamer to set RT priority for created threads. The attached patch can be used for this purpose: gstreamer-priority.zip

To enable RT thread priorities, define the environment variable "FJN_T". To disable RT thread priorities, remove "FJN_T" environment variable.

The attached index.zip contains an html page that plays videos in a loop. Serve this file on a web server and open a browser instance on the corresponding url with the above mention env var defined. The issue should be reproducible in 10-30 min.

Internal Reference: LLAMA-15112

When real time (RT) thread priorities are used for some of the gstreamer pipeline elements, we may run into a situation where several RT threads start spinning during a mutex acquisition process, leading to a system hang as most other threads won't be able to run. Sequence of events leading up to the hang: 1. A web process thread acquires the mutex lock for the heap and is then involuntary descheduled, and does not run again 2. vqueue:src (RT priority) enters the lockSlowCase and starts spinning in the while loop 3. multiqueue0:src (instance 1, RT priority) enters the lockSlowCase and starts spinning in the while loop 4. aqueue:src (RT priority) enters the lockSlowCase and starts spinning in the while loop 5. multiqueue0:src (instance 2, RT priority) enters the lockSlowCase and starts spinning in the while loop Once stage 5 is hit, the box is hung as the only thing that can run on a CPU core is: 1. one of the above RT threads (aqueue, vqueue, or multiqueue) 2. any other RT thread with a priority equal or greater than the above RT threads 3. any h/w irq The use of the usleep() will allow the low priority process to run and release the mutex lock, avoiding the hang Author of issue analysis and fix proposal: Steven Webster

filipe-norte-red marked this pull request as ready for review September 24, 2024 11:17

modeveci requested review from philn, magomez and calvaris September 24, 2024 12:28

modeveci added the wpe-2.38 label Sep 24, 2024

filipe-norte-red force-pushed the wpe-2.38-fix-bmalloc-hang-with-rt-thread-priorities branch from 9ac2855 to a170a0a Compare September 25, 2024 11:49

calvaris mentioned this pull request Sep 27, 2024

Workaround bmalloc hang with RT thread priorities WebKit/WebKit#34353

Open

eocanha added the upstream Related to an upstream bug (or should be at some point) label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bmalloc hang with RT thread priorities #1408

Fix bmalloc hang with RT thread priorities #1408

filipe-norte-red commented Sep 24, 2024 •

edited

Loading

Fix bmalloc hang with RT thread priorities #1408

Are you sure you want to change the base?

Fix bmalloc hang with RT thread priorities #1408

Conversation

filipe-norte-red commented Sep 24, 2024 • edited Loading

Description:

Proposed fix / analysis summary:

Reproduction

filipe-norte-red commented Sep 24, 2024 •

edited

Loading