Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling NR APM causes trace concatenation in Datadog #692

Open
5 of 18 tasks
timmc-edx opened this issue Jun 18, 2024 · 4 comments
Open
5 of 18 tasks

Disabling NR APM causes trace concatenation in Datadog #692

timmc-edx opened this issue Jun 18, 2024 · 4 comments
Assignees

Comments

@timmc-edx
Copy link
Member

timmc-edx commented Jun 18, 2024

Ultimately, this ticket is for disabling New Relic APM across edxapp. We ran into trace related issues in DD when first attempting to disable NR APM. We later caused the same issue in Edge when simply disabling NR Tracing.

This bug has been observed in edxapp (LMS and CMS), enterprise-catalog, and registrar. It can be identified by searching for spans matching operation_name:django.request -@_top_level:*.

Acceptance criteria

Things to try

  • On the Celery hypothesis:
    • Add a workaround: When middleware discovers the root span is already finished, walk towards the root and call finish() on all celery.apply spans.
    • Try repro in devstack with Celery broker in place -- need to run tasks async for a repro, we think. May need to try it in registrar rather than edxapp, since registrar-worker already exists. Could point lms at registrar-worker. Check to see if celery.apply spans are created, even if we don't see anomalous traces.
  • Create a forked ddtrace where tracer.start_span checks child_of for a finished span. Subject to Waffle flags, log the situation and/or set child_of to None.
  • Hold off on these, as we have a lead on the Celery side of things:
    • Downgrade to ddtrace 2.7.9 to see if lazy tracing is the culprit
    • Disable distributed tracing via ddtrace.config.django['distributed_tracing_enabled']=False
    • Add more details to anomalous trace logging -- current span ID, parent span (ID, operation, resource -- maybe the whole thing)

Things we have already tried

These should be checked off once they have already been either reverted or made permanent:

Details

When we disabled NR APM in edxapp on June 6 we observed two anomalies with traces:

  • At about 17:34, the rate of service entry spans (and traces) in service:edx-edxapp-lms env:prod dropped precipitously by 2-3x.
  • At the same time, the rate of all spans increased by about 5x. (This anomaly appears to have disappeared, retroactively!)

However, we believe the actual traffic was unchanged. This is corroborated by the Django hit metrics remaining steady, as seen in the Service Catalog. We cannot find any relevant code or config changes that would have been deployed around that time.

Our current understanding is that the majority of Django web requests that are traced are not recorded as service entry spans, but are instead parented to a different trace. This causes several problems:

  • Monitors and dashboards that are based on traces or service entry spans are unreliable.
  • It's difficult to impossible to view most of our traces because they become concatenated into 50,000-span-long monsters that can't be loaded in the browser.

Image

We can also reproduce this issue by setting "Tracing type: None" in the application settings in NR (usually set to Distributed Tracing).

Links

@robrap
Copy link
Contributor

robrap commented Jun 18, 2024

Other notes:

@robrap
Copy link
Contributor

robrap commented Jun 21, 2024

Additional thoughts, questions, ideas:

  • We should double-check distributed tracing settings for tracing type (want updated Distributed Tracing).
  • If we decide to try to use the NR test account, we should ensure it is safe. We can remove other users (if they are still there).
  • What does the nginx request header set? Could this be interfering?
  • APM Profiling caused a different problem with additional spans, but is it possible it could heal this issue without NR?
  • If it comes to this, can we find a way to replicate locally?
  • Could there be important information buried in the logs that we are no longer seeing? See Determine if Datadog trace-debug fix needs replacement #610.
  • Could the invalid traceparent log messages be related? See Ensure "received invalid w3c traceparent" has resolved #665.

@robrap robrap mentioned this issue Jun 24, 2024
6 tasks
timmc-edx added a commit to edx/configuration that referenced this issue Jun 25, 2024
timmc-edx added a commit to edx/configuration that referenced this issue Jun 25, 2024
@robrap
Copy link
Contributor

robrap commented Jun 25, 2024

Currently, we're investigating if using a NR Free Tier account for edxapp is enough to get DD traces working.

Other possibilities may include trying to get tracing (or APM) disabled everywhere in Edge. This includes where Spans were found in the last day:

prod-edge-edxapp-cms 9.28 M
prod-edge-analytics-api 2.35 M
prod-edge-notes 2.3 M
prod-edge-edxapp-workers-lms 755 k
prod-edge-forum 675 k
prod-edge-edxapp-workers-cms 188 k

timmc-edx added a commit to edx/newrelic-python-agent that referenced this issue Jun 26, 2024
…tation) (#1)

If the Django setting `EDX_NEWRELIC_NO_REPORT` is present and enabled, the agent will not talk to New Relic's servers and will instead use a set of previously captured responses from our sandbox account.

Instrumentation (tracing, etc.) will still be in place, but the data will be discarded rather than being reported.

See edx/edx-arch-experiments#692
@robrap
Copy link
Contributor

robrap commented Jun 27, 2024

[idea] We might want 3 modes for our hacked NR agent:

  1. Send no data to NR, but add tracing info (fixes DD traces).
  2. Send no data to NR, and fake a bad account id so we don't even take a performance hit (breaks DD traces).
  3. Send data to NR (costs money, but if we want to verify anything temporarily, we can.

timmc-edx added a commit that referenced this issue Jul 10, 2024
timmc-edx added a commit that referenced this issue Jul 10, 2024
See #692

Testing setup:
https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack

And then in lms-shell:

```
make requirements
pip install ddtrace
pip install -e /edx/src/archexp/
./wrap-datadog.sh ./server.sh
```

Expect to see this log message:
`Attached MissingSpanProccessor for Datadog diagnostics`
timmc-edx added a commit that referenced this issue Jul 10, 2024
See #692

Testing setup:
https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack

And then in lms-shell:

```
make requirements
pip install ddtrace
pip install -e /edx/src/archexp/
./wrap-datadog.sh ./server.sh
```

Expect to see this log message:
`Attached MissingSpanProccessor for Datadog diagnostics`

NOTE: This prints "Spans created = 0; spans finished = 0" in devstack when
shut down with ctrl-c, but not when restarted due to autoreload (where it
prints correct info). Something is initializing Django twice and one span
processor is getting span info while the other is printing at shutdown.
There's more to debug here, but it seems stable enough to least try
deploying it.
timmc-edx added a commit that referenced this issue Jul 10, 2024
See #692

Testing setup:
https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack

And then in lms-shell:

```
make requirements
pip install ddtrace
pip install -e /edx/src/archexp/
./wrap-datadog.sh ./server.sh
```

Expect to see this log message:
`Attached MissingSpanProccessor for Datadog diagnostics`

NOTE: This prints "Spans created = 0; spans finished = 0" in devstack when
shut down with ctrl-c, but not when restarted due to autoreload (where it
prints correct info). Something is initializing Django twice and one span
processor is getting span info while the other is printing at shutdown.
There's more to debug here, but it seems stable enough to least try
deploying it.
timmc-edx added a commit that referenced this issue Jul 24, 2024
Adds logging diagnostics for traces in Datadog.

See #692
timmc-edx added a commit that referenced this issue Jul 24, 2024
Adds logging diagnostics for traces in Datadog.

See #692
@robrap robrap removed their assignment Aug 8, 2024
timmc-edx added a commit to edx/configuration that referenced this issue Aug 16, 2024
timmc-edx added a commit to edx/configuration that referenced this issue Aug 20, 2024
timmc-edx added a commit to openedx/edx-platform that referenced this issue Sep 13, 2024
- Convert `/heartbeat` view into a celery test
- Send Celery tasks to a broker, rather than running in-process
- Hardcode a broker URL
- Log all celery signals

See edx/edx-arch-experiments#692
timmc-edx added a commit to openedx/edx-platform that referenced this issue Sep 13, 2024
- Add `/celery_repro` URL to run a sample task
- Send Celery tasks to a broker, rather than running in-process
- Hardcode a broker URL
- Log all celery signals

See edx/edx-arch-experiments#692
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants