Failure to trace state when preparing for reboot #723

inglesp · 2024-04-09T14:23:53Z

I ran just jobrunner/stop and then just jobrunner/prepare-for-reboot at the start of the maintenance window for opensafely-core/sysadmin#168.

Several tracebacks were logged to the screen. Unfortunately I didn't capture them before the server was rebooted, and so I do not have a complete record.

As far as I could tell, there was one traceback per job. The tracebacks were caught and logged from finish_current_job:

job-runner/jobrunner/tracing.py

Lines 117 to 130 in 22f9fd5

    
           def finish_current_state(job, timestamp_ns, error=None, results=None, **attrs): 
        
               """Record a span representing the state we've just exited.""" 
        
               if not _traceable(job): 
        
                   return 
        
               # allow them to be filtered out from tracking spans 
        
               attrs["is_state"] = True 
        
               try: 
        
                   name = job.status_code.name 
        
                   start_time = job.status_code_updated_at 
        
                   record_job_span(job, name, start_time, timestamp_ns, error, results, **attrs) 
        
               except Exception: 
        
                   # make sure trace failures do not error the job 
        
                   logger.exception(f"failed to trace state for {job.id}")

And the exception message was: AttributeError: 'NonRecordingSpan' object has no attribute 'name'.

However I don't have a record of where the exception was raised from.

As far as I can tell, the logs do not indicate a problem with the stopping the job or changing the state, but only that the change of state could not be traced.

The text was updated successfully, but these errors were encountered:

inglesp · 2024-04-09T14:24:41Z

The only place we look up .name on a span in our own code is here:

job-runner/jobrunner/tracing.py

Lines 261 to 263 in 22f9fd5

    
           logger.info( 
        
               f"Trace span {span.name} attribute {k} was set invalid type: {v}, type {type(v)}" 
        
           )

inglesp · 2024-04-09T16:30:09Z

I think this is probably caused by the prepare_for_reboot script not setting up tracing, meaning that we don't have a real tracer object. The jobrunner service does this by calling jobrunner.tracing.setup_default_tracing:

job-runner/jobrunner/service.py

Line 33 in 22f9fd5

tracing.setup_default_tracing()

We should ensure that tracing is set up by this script (and any others), and we should consider being defensive against it not being set up, perhaps by writing a wrapper for get_tracer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to trace state when preparing for reboot #723

Failure to trace state when preparing for reboot #723

inglesp commented Apr 9, 2024

inglesp commented Apr 9, 2024

inglesp commented Apr 9, 2024

Failure to trace state when preparing for reboot #723

Failure to trace state when preparing for reboot #723

Comments

inglesp commented Apr 9, 2024

inglesp commented Apr 9, 2024

inglesp commented Apr 9, 2024