worker: fix deadlock when `LoggingThread` wrote into its own `Queue` #257

lzaoral · 2024-03-25T16:31:54Z

If self._hub.upload_task_log() called self._queue.put(), it would cause deadlock because:

self._queue uses locks that are not reentrant.
it will block if the Queue is already full.

siteshwar · 2024-03-25T17:01:00Z

Such failures were seen with firefox builds on staging instance of OpenScanHub. I do not have logs from previous failures, but I have triggered another build to reproduce this problem. It takes around 4-5 hours to reproduce this issue.

rohanpm · 2024-03-25T23:26:36Z

I'm a little confused under exactly what conditions this occurs. I don't suppose you have a backtrace in the deadlock case, i.e. how exactly did we arrive at write ?

I'm asking because: the first thing that comes to mind is that there's a logger which is attached to sys.stdout or stderr, which feeds back into the LoggingThread, so if such a logger is activated from within the LoggingThread it can get stuck waiting for space in the queue forever.

But in this change, you've also added a logging statement in this case: self._logger.log_error("Error in LoggingThread: %r", data). I'm reading through the code and trying to understand exactly how this logger is initialized, and it looks like at least in some cases, it will be attached to stderr. I believe it's the same logger which appears in this snippet from kobo/worker/main.py:

    if foreground and tm._logger is not None:
        kobo.log.add_stderr_logger(tm._logger)

But if so, will that lead to infinite recursion? It seems like that log statement can bring us back into write which will again fall into the elif block and log the same message again, and so on.

kdudka · 2024-03-26T07:31:20Z

@rohanpm Good point. If self._logger is writing to stderr, we will have a problem. In our case, the logger was writing to a separate file. The backtrace of the logging thread in the deadlock case looked like this:

Traceback (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 312, in wait
    waiter.acquire()
  File "/usr/lib64/python3.9/queue.py", line 140, in put
    self.not_full.wait()
  File "/usr/lib/python3.9/site-packages/kobo/worker/logger.py", line 96, in write
    self._queue.put(data)
  File "/usr/lib/python3.9/site-packages/kobo/worker/logger.py", line 119, in write
    self._thread.write(data)
  <built-in method print of module object at remote 0x7f8697bcfae0>
  File "/usr/lib/python3.9/site-packages/kobo/xmlrpc.py", line 487, in request
    print("XML-RPC connection to %s failed: %s, %s" % (args[0], ex.args[1:], retries), file=sys.stderr)
  File "/usr/lib64/python3.9/xmlrpc/client.py", line 1464, in __request
    response = self.__transport.request(
  File "/usr/lib64/python3.9/xmlrpc/client.py", line 1122, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python3.9/site-packages/kobo/client/__init__.py", line 510, in upload_task_log
    self._hub.worker.upload_task_log(task_id, remote_file_name, mode, chunk_start, chunk_len, chunk_checksum, encoded_chunk)
  File "/usr/lib/python3.9/site-packages/kobo/worker/logger.py", line 69, in run
    self._hub.upload_task_log(BytesIO(self._send_data), self._task_id, "stdout.log", append=True)
  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()

If self._hub.upload_task_log() called self._queue.put(), it would cause deadlock because 1. self._queue uses locks that are not reentrant. 2. it will block if the Queue is already full. Co-authored-by: Kamil Dudka <[email protected]> Co-authored-by: Lukáš Zaoral <[email protected]>

lzaoral · 2024-04-03T12:37:25Z

@rohanpm I've amended the commit to avoid infinite recursion when the logger has a StreamHandler attached that points to a LoggingIO instance using the same thread.

Note that it is not the case of kobo/worker/main.py because the stderr handler is added in the main worker process but sys.stderr is only reassigned in child processes.

worker: remove Python 2 code from logger.py

af73a03

lzaoral force-pushed the worker-fix-logger-deadlock branch from 2072322 to bc3df3e Compare April 3, 2024 12:32

rohanpm approved these changes Apr 4, 2024

View reviewed changes

rohanpm merged commit c08f3d9 into release-engineering:master Apr 4, 2024
19 checks passed

lzaoral deleted the worker-fix-logger-deadlock branch April 4, 2024 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker: fix deadlock when `LoggingThread` wrote into its own `Queue` #257

worker: fix deadlock when `LoggingThread` wrote into its own `Queue` #257

lzaoral commented Mar 25, 2024 •

edited

Loading

siteshwar commented Mar 25, 2024

rohanpm commented Mar 25, 2024

kdudka commented Mar 26, 2024

lzaoral commented Apr 3, 2024

worker: fix deadlock when LoggingThread wrote into its own Queue #257

worker: fix deadlock when LoggingThread wrote into its own Queue #257

Conversation

lzaoral commented Mar 25, 2024 • edited Loading

siteshwar commented Mar 25, 2024

rohanpm commented Mar 25, 2024

kdudka commented Mar 26, 2024

lzaoral commented Apr 3, 2024

worker: fix deadlock when `LoggingThread` wrote into its own `Queue` #257

worker: fix deadlock when `LoggingThread` wrote into its own `Queue` #257

lzaoral commented Mar 25, 2024 •

edited

Loading