Skip to content

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Aug 28, 2025

This is a first step towards ContextVar based LoggingContext. This PR only goes as far as to store the LoggingContext in a ContextVar instead of thread-local. But this still gives us the benefit of being able to remove the painful log context rule complexity around needing to make sure the thread-local is set correctly as awaitables are suspended and resumed in the Twisted reactor.

Part of #10342 (previously matrix-org/synapse#10342)

This is purely based on @sandhose's branch which I've just picked up, kicked the tires, and brought forward to propose and merge.

This is spawning from adding server_name to the LoggingContext and finding that we use the sentinel LoggingContext in many places (which means the server_name isn't tracked in those places). After removing the sentinel LoggingContext from a few places, it uncovered some places where we don't seem to be following the log context rules so things are getting messed up. Instead of trying to adapt a bunch of tricky areas to follow the rules, I decided to just try removing the need for the log context rules and just refactor to the ContextVar based LoggingContext.

Testing strategy

  1. Run Synapse normally and with daemonize: true: poetry run synapse_homeserver --config-path homeserver.yaml
  2. Execute some requests
  3. Shutdown the server
  4. Look for any bad log entries in your homeserver logs:
    • Expected logging context sentinel but found main
    • Expected logging context main was lost
    • Expected previous context
    • utime went backwards!/stime went backwards!
    • Called stop on logcontext POST-0 without recording a start rusage

Todo

  • Update docs/log_contexts.md
  • Fix failing tests in tests/util/caches/test_descriptors.py (synapse/util/patch_inline_callbacks.py)
  • Handle log contexts across Rust
    • For example, this spot says it doesn't handle ContextVars:
      /// Creates a twisted deferred from the given future, spawning the task on the
      /// tokio runtime.
      ///
      /// Does not handle deferred cancellation or contextvars.
      fn create_deferred<'py, F, O>(

Future:

Dev notes

SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.util.caches.test_descriptors

SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.util.test_logcontext

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

Comment on lines +852 to 853
# TODO: This function is a no-op now and should be removed in a follow-up PR.
def make_deferred_yieldable(deferred: "defer.Deferred[T]") -> "defer.Deferred[T]":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_deferred_yieldable no longer does anything (no-op) but there a lot of references to clean-up. I think it would be better to do this in a follow-up PR than bulk up with this diff with changes that will cloud the main change we're trying to introduce.

@github-actions github-actions bot deployed to PR Documentation Preview August 28, 2025 03:48 Active
Comment on lines +62 to +64
Ideally, nothing from the Synapse homeserver would be logged against the `sentinel`
context as we want to know where the logs came from. In practice, this is not always the
case yet especially outside of request handling.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over time, we can remove PreserveLoggingContext from many scenarios that cause the sentinel context to be used.

I've already started this separately in #18870

@github-actions github-actions bot deployed to PR Documentation Preview September 2, 2025 19:08 Active
@@ -1,250 +0,0 @@
#
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, these checks provide no value to us anymore. We don't have specific log rules to worry about anymore and the ContextVar properly follows the context regardless.

self.assertEqual(
current_context(),
SENTINEL_CONTEXT,
c1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, these new context values make sense. We're in context c1, so current_context() should be c1.

Comment on lines -89 to -94
if bool(os.environ.get("SYNAPSE_TEST_PATCH_LOG_CONTEXTS", False)):
# We import here so that we don't have to install a bunch of deps when
# running the packaging tox test.
from synapse.util.patch_inline_callbacks import do_patch

do_patch()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #18871 (comment) for why we've removed the patch_inline_callbacks

@github-actions github-actions bot deployed to PR Documentation Preview September 2, 2025 19:29 Active
When we `daemonize`, we fork the process and cputime metrics get confused
about the per-thread resource usage appearing to go backwards because we're
comparing the resource usage (`rusage`) from the original process to the
forked process.

We now kick off the background tasks (`run_as_background_process`) after we
have forked the process so the `rusage` we record when we `start` is in the
same thread when we `stop`.

Bad log examples from before:
```
synapse.logging.context - ERROR - _schedule_next_expiry-0 - utime went backwards! 0.050467 < 0.886526
synapse.logging.context - ERROR - _schedule_db_events-0 - stime went backwards! 0.009941 < 0.155106
synapse.logging.context - ERROR - wake_destinations_needing_catchup-0 - stime went backwards! 0.010175 < 0.130923
synapse.logging.context - ERROR - resume_sync_partial_state_room-0 - utime went backwards! 0.052898 < 0.886526
```

Testing strategy:

 1. Run with `daemonize: true` in your `homeserver.yaml`
 1. `poetry run synapse_homeserver --config-path homeserver.yaml`
 1. Shutdown the server
 1. Look for any bad log entries in your homeserver logs:
    - `Expected logging context sentinel but found main`
    - `Expected logging context main was lost`
    - `utime went backwards!`/`stime went backwards!`
@github-actions github-actions bot deployed to PR Documentation Preview September 2, 2025 20:20 Active
return _current_context.get(SENTINEL_CONTEXT)


def set_current_context(context: LoggingContextOrSentinel) -> LoggingContextOrSentinel:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: The docstring needs to be updated


_thread_local = threading.local()
_thread_local.current_context = SENTINEL_CONTEXT
_current_context: ContextVar[LoggingContextOrSentinel] = ContextVar("current_context")
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error: Called stop on logcontext POST-0 without recording a start rusage

There is a problem where the POST-0 LoggingContext is somehow becoming the current context without a corresponding set_current_context(POST-0) call.

See the lines marked red in the snippet below.

SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.rest.client.test_rooms.RoomStateTestCase.test_get_state_event_cancellation

_trial_temp/test.log

  2025-09-02 19:12:06-0500 [-] synapse.http.site - 304 - INFO - sentinel - asdf SynapseRequest render
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 662 - INFO - sentinel - asdf PreserveLoggingContext(POST-0).__enter__ nonce=meZAG
+ 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(POST-0) (previous=sentinel)
+ 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - POST-0 - asdf LoggingContext(POST-0).start  usage_start=True
+ 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 675 - INFO - POST-0 - asdf PreserveLoggingContext(POST-0).__exit__ nonce=meZAG restoring old_context=sentinel
+ 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - POST-0 - asdf set_current_context(sentinel) (previous=POST-0)
+ 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - POST-0 - asdf LoggingContext(POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 662 - INFO - sentinel - asdf PreserveLoggingContext(sentinel).__enter__ nonce=xQsLm
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(sentinel) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(_handle_new_device_update_async-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - _handle_new_device_update_async-0 - asdf LoggingContext(_handle_new_device_update_async-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 675 - INFO - sentinel - asdf PreserveLoggingContext(sentinel).__exit__ nonce=xQsLm restoring old_context=sentinel
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(sentinel) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-_handle_new_device_update_async-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-_handle_new_device_update_async-0 - asdf LoggingContext(db-_handle_new_device_update_async-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-_handle_new_device_update_async-0 - asdf set_current_context(sentinel) (previous=db-_handle_new_device_update_async-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-_handle_new_device_update_async-0 - asdf LoggingContext(db-_handle_new_device_update_async-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-_handle_new_device_update_async-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-_handle_new_device_update_async-0 - asdf LoggingContext(db-_handle_new_device_update_async-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-_handle_new_device_update_async-0 - asdf set_current_context(sentinel) (previous=db-_handle_new_device_update_async-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-_handle_new_device_update_async-0 - asdf LoggingContext(db-_handle_new_device_update_async-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - _handle_new_device_update_async-0 - asdf set_current_context(sentinel) (previous=_handle_new_device_update_async-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - _handle_new_device_update_async-0 - asdf LoggingContext(_handle_new_device_update_async-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - sentinel - asdf set_current_context(db-POST-0) (previous=sentinel)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 454 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).start  usage_start=True
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - db-POST-0 - asdf set_current_context(sentinel) (previous=db-POST-0)
  2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - db-POST-0 - asdf LoggingContext(db-POST-0).stop usage_start=True rusage=True
- 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 662 - INFO - POST-0 - asdf PreserveLoggingContext(sentinel).__enter__ nonce=JvLjU
- 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 723 - INFO - POST-0 - asdf set_current_context(sentinel) (previous=POST-0)
- 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 471 - INFO - POST-0 - asdf LoggingContext(POST-0).stop usage_start=False rusage=True
- 2025-09-02 19:12:06-0500 [-] synapse.logging.context - 488 - ERROR - POST-0 - asdf Called stop on logcontext POST-0 without recording a start rusage

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is a Twisted bug?

I was under the impression that Twisted supported ContextVar's but now I'm not sure. All of the issues mentioned in matrix-org/synapse#10342 are resolved but there are other things in the Twisted tracker:

Unresolved issues:

Resolved issues:

And we could even be running into something unreported 🤷 Need to investigate more.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From some more debugging, I think the ContextVar is acting normally.

And this may just be the case that the LoggingContext start/stop pattern isn't compatible with the ContextVar we're using now. We'd have to maintain the log context rules 🤔.

In this case, it's the SynapseRequest.logcontext where stop is called because we have a PreserveLoggingContext around SynapseRequest.render which only kicks off the render and doesn't wait for it to finish so we stop way before the request is done. And it's never re-started. So when other LoggingContext utilities are used in the downstream code to set_current_context, it will stop the already stopped SynapseRequest.logcontext.

I go back and forth on whether we can update things to work correctly. If I naively try to manage the lifetime myself by calling self.logcontext.__enter__ manually in SynapseRequest.render, it still doesn't work out.

Comment on lines +604 to +608
# Register background tasks required by this server. This must be done
# somewhat manually due to the background tasks not being registered
# unless handlers are instantiated.
if hs.config.worker.run_background_tasks:
hs.start_background_tasks()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out this change to #18886 since it seems good in any case

And this PR may get stale

MadLittleMods added a commit that referenced this pull request Sep 9, 2025
Spawning from #18871

[This change](6ce2f3e)
was originally used to fix CPU time going backwards when we `daemonize`.

While, we don't seem to run into this problem on `develop`, I still
think this is a good change to make. We don't need background tasks
running on a process that will soon be forcefully exited and where the
reactor isn't even running yet. We now kick off the background tasks
(`run_as_background_process`) after we have forked the process and
started the reactor.

Also as simple note, we don't need background tasks running in both halves of a fork.
@MadLittleMods
Copy link
Contributor Author

Closing as I've decided to continue trudging in the LoggingContext logcontext mud (see #18905)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants