Agent job refactor #371

boukeas · 2024-10-04T19:49:39Z

Description

This PR is a refactoring of the main Testflinger agent loop, i.e. the TestflingerAgent.process_jobs method which controls how an agent picks up a job and controls the execution of its individual phases.

A lot of functionality has been abstracted away from the TestflingerAgent.process_jobs method into the TestflingerJob class and the newly introduced JobPhase class and its derived classes.

Note to reviewers

The structural changes are significant and focusing narrowly on the diff in order to review this sort of refactoring would be disorientating and make little sense. Please read and review TestflingerAgent.process_jobs and the job.py module in their entirety.

Resolved issues

Documentation

Web service API changes

Tests

plars

This is one spot that's been begging for a refactor for a while now. I really like your idea and where you are taking it, thanks for taking this on!
I know it's still draft and some obvious things. I was actually able to get it running through some phases in some local testing. I just set up some fake commands for it to run and had it attach to a server I was running in docker locally, then submitted a job to that server for it to pick up. I did find one corner case because it was a new agent and didn't have the directory structure in place for (results, logs, run...). There's an os.makedirs() you removed that I think we'll either need to keep, or handle somewhere to ensure that run dir exists before we try to drop the job file into it.
Looking good so far!

agent/testflinger_agent/agent.py

agent/testflinger_agent/tests/test_agent.py

agent/testflinger_agent/job.py

…event

plars

A few minor comments. I did a bit of testing with this and it mostly worked for me except for one thing that was totally my fault (forgot the current server revision deployed is the one from Varun's branch).

The main thing to fix here is documentation and possibly a renaming of "unpack" but see what you think. I'm not tied to a specific word there as long as it's documented.

But there's an opportunity to fix another bug here too. I think it might be an easy opportunistic fix, and related enough to include it here. But if not, then feel free to kick it back and we can file it as a bug and handle it in a separate PR.

Also, please go ahead and take this out of draft mode so that everyone knows it's something you think is ready for review and merging as soon as it's good :)

plars · 2024-11-12T21:09:07Z

agent/testflinger_agent/agent.py


            try:
-                self.client.transmit_job_outcome(rundir)
+                self.client.transmit_job_outcome(str(job.params.rundir))


So I forgot that I had the edge revision of serve from Varun's branch deployed when I was testing this today, and predictably hit a 422 error when it tried to transmit the outcome of a job containing attachments. But then I realized the job was never marked complete. This is because that information is conveyed through transmit_job_outcome. The result is that the CLI is just left in a "stuck" state waiting for the job to complete.

Of course, this isn't really from your patch, but it gave me an "aha" moment for this pathological case that could absolutely occur. I wonder if instead of just failing here, if we could go ahead an try to at least do a client.post_job_state() with the state that we intend to transmit (cancelled or complete) if we fall into the exception block here? That would help with things like this in the future.

plars · 2024-11-12T21:12:12Z

common/testflinger_common/enums.py

@@ -22,6 +22,7 @@

 class JobState(StrEnum):
    WAITING = "waiting"
+    UNPACK = "unpack"


should we call this "unpack" or some variation of "attachments"? I think the latter sounds more awkward but also makes it more obvious what's happening in this phase.

Either way, we should make sure the documentation is updated to reflect this new phase. In particular, in docs/reference/test-phases.rst and perhaps others that may describe results schema

boukeas added 6 commits October 4, 2024 11:45

refactor: Introduce job phases and refactor agent job loop (WIP)

3c343cc

Merge branch 'main' into agent-job-refactor

bf3a7fa

fix: required changes to pass tests

035bd5c

fix: formatting (black)

b343edb

fix: Python 3.8 compatibility issues

8f202c5

refactor: run_core returns the result instead of storing it

352f4e0

boukeas mentioned this pull request Oct 7, 2024

[CERTTF-412][CERTTF-413] refactor: treat attachment unpacking as a separate phase #365

Closed

boukeas requested a review from val500 October 7, 2024 16:23

plars requested changes Oct 9, 2024

View reviewed changes

agent/testflinger_agent/agent.py Show resolved Hide resolved

agent/testflinger_agent/tests/test_agent.py Outdated Show resolved Hide resolved

agent/testflinger_agent/job.py Outdated Show resolved Hide resolved

agent/testflinger_agent/job.py Outdated Show resolved Hide resolved

boukeas added 4 commits October 9, 2024 18:05

chore: clean up extraneous comments and commented-out code

914c2c2

refactor: minor changes to the way the phase banner is displayed

41f8a6d

fix: set the end_reason when cancelling, to be included in the end …

51e0403

…event

fix: make sure the execution rundir is created recursively

53498ff

boukeas requested a review from plars October 28, 2024 15:44

boukeas added 4 commits October 31, 2024 11:47

Merge branch 'main' into agent-job-refactor

792d9b4

refactor: treat attachment unpacking as a phase

2134b19

refactor: handle negative process return codes in the runner

cfb8726

chore: minor cleanup and comments

a9f92ff

plars requested changes Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent job refactor #371

Agent job refactor #371

boukeas commented Oct 4, 2024 •

edited

Loading

plars left a comment

plars left a comment

plars Nov 12, 2024

plars Nov 12, 2024

Agent job refactor #371

Are you sure you want to change the base?

Agent job refactor #371

Conversation

boukeas commented Oct 4, 2024 • edited Loading

Description

Note to reviewers

Resolved issues

Documentation

Web service API changes

Tests

plars left a comment

Choose a reason for hiding this comment

plars left a comment

Choose a reason for hiding this comment

plars Nov 12, 2024

Choose a reason for hiding this comment

plars Nov 12, 2024

Choose a reason for hiding this comment

boukeas commented Oct 4, 2024 •

edited

Loading