Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to trace state when preparing for reboot #723

Open
inglesp opened this issue Apr 9, 2024 · 2 comments
Open

Failure to trace state when preparing for reboot #723

inglesp opened this issue Apr 9, 2024 · 2 comments

Comments

@inglesp
Copy link
Contributor

inglesp commented Apr 9, 2024

I ran just jobrunner/stop and then just jobrunner/prepare-for-reboot at the start of the maintenance window for opensafely-core/sysadmin#168.

Several tracebacks were logged to the screen. Unfortunately I didn't capture them before the server was rebooted, and so I do not have a complete record.

As far as I could tell, there was one traceback per job. The tracebacks were caught and logged from finish_current_job:

def finish_current_state(job, timestamp_ns, error=None, results=None, **attrs):
"""Record a span representing the state we've just exited."""
if not _traceable(job):
return
# allow them to be filtered out from tracking spans
attrs["is_state"] = True
try:
name = job.status_code.name
start_time = job.status_code_updated_at
record_job_span(job, name, start_time, timestamp_ns, error, results, **attrs)
except Exception:
# make sure trace failures do not error the job
logger.exception(f"failed to trace state for {job.id}")

And the exception message was: AttributeError: 'NonRecordingSpan' object has no attribute 'name'.

However I don't have a record of where the exception was raised from.

As far as I can tell, the logs do not indicate a problem with the stopping the job or changing the state, but only that the change of state could not be traced.

@inglesp
Copy link
Contributor Author

inglesp commented Apr 9, 2024

The only place we look up .name on a span in our own code is here:

logger.info(
f"Trace span {span.name} attribute {k} was set invalid type: {v}, type {type(v)}"
)

@inglesp
Copy link
Contributor Author

inglesp commented Apr 9, 2024

I think this is probably caused by the prepare_for_reboot script not setting up tracing, meaning that we don't have a real tracer object. The jobrunner service does this by calling jobrunner.tracing.setup_default_tracing:

tracing.setup_default_tracing()

We should ensure that tracing is set up by this script (and any others), and we should consider being defensive against it not being set up, perhaps by writing a wrapper for get_tracer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant