Fix bmalloc hang with RT thread priorities #1408
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
When real time (RT) thread priorities are used for some of the gstreamer pipeline elements, we may run into a situation where several RT threads start spinning during a mutex acquisition process, leading to a system hang as most other threads won't be able to run.
Sequence of events leading up to the hang:
Once stage 5 is hit, the box is hung as the only thing that can run on a CPU core is:
The use of the usleep() will allow the low priority process to run and release the mutex lock, avoiding the hang
Author of issue analysis and fix proposal: Steven Webster
Proposed fix / analysis summary:
The proposed fix is to replace the sched_yield() call with a usleep() call. This will guarantee that the calling thread will deschedule for the specified time period, allowing the low priority thread to run and release the mutex lock, avoiding the hang.
This fix also has the benefit of reducing the cpu usage of the threads that enter tight while() loop in lockSlowCase() and spin waiting for the mutex to be released.
An example of how much cpu runtime can be saved is seen by comparing the kernelshark screenshots. The table below shows the actual thread execution time as a percentage of over runtime:
The choice of the usleep value is a tradeoff between lower the %execution time of runtime against the usleep time being greater than lockSlowCase() would normally run for.
Two values of usleep were measured:
NOTE: - the % of calls to lockSlowCase() that were less than either of the usleep values, were measured over a 30min period, so the value could move up/down as the measurement period is increased
The recommended usleep time is 150us as this gives a lower overhead ratio, lower execution to runtime ratio for a small increase in the number of times a thread may block longer than it would originally run for.
Reproduction
To reproduce this issue, the easiest way is to patch gstreamer to set RT priority for created threads. The attached patch can be used for this purpose: gstreamer-priority.zip
To enable RT thread priorities, define the environment variable "FJN_T". To disable RT thread priorities, remove "FJN_T" environment variable.
The attached index.zip contains an html page that plays videos in a loop. Serve this file on a web server and open a browser instance on the corresponding url with the above mention env var defined. The issue should be reproducible in 10-30 min.
Internal Reference: LLAMA-15112