Skip to content

Conversation

@machshev
Copy link
Collaborator

@machshev machshev commented Nov 6, 2025

The Deploy classes are coupled to the scheduler and also the flow objects in a way that is not clear. It's also not clear which of the public attributes are depended on externally and which are internal working out.

The intention is to make clear what the interface is that the scheduler depends on and what the flow object depends on as a result of the job runs. Initially I tried to refactor the Deploy objects themselves to make the existing objects clear, however this failed due to too many moving parts and the level of coupling.

This commit introduces a JobSpec pydantic data model that gives a fully typed and runtime validated model containing everything the scheduler needs to run the job. The scheduler no longer receives Deploy objects, but instead JobSpec objects created from Deploy objects. Which means Deploy objects become JobSpec factories.

In addition, the obvious dependencies on Deploy objects within the flow objects have been replaced with an expanded CompletedJobStatus pydantic model. There may still be dependencies that exist that have been missed, but the main ones have been removed in the results processing path.

flowchart LR
    subgraph flowS [flows]
        F1[Flow 1]
        F2[Flow 2]
    end

    subgraph Deploy
        D1(Deploy 1)
        D2(Deploy 2)
        D3(Deploy 3)
        J1(JobSpec 1)
        J2(JobSpec 2)
        J3(JobSpec 3)
    end

    subgraph Results
      C1
      C2
      C3
    end

    subgraph flowE [flows]
        F1e[Flow 1]
        F2e[Flow 2]
    end

    F1 --> D1 --> J1 --> S[Scheduler] --> C1(CompletedJobStatus) --> F1e
    F1 --> D2 --> J2 --> S --> C2(CompletedJobStatus) --> F1e
    F2 --> D3 --> J3 --> S --> C3(CompletedJobStatus) --> F2e
Loading

This opens the way for writing tests for the scheduler, and potentially the launchers, against clear interfaces using data classes as inputs and outputs.

There is room for improvement in the data classes, as some attributes can potentially be removed with some refactoring.

Implements #46

Copy link
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general point, I'm very excited about this sort of refactor: it makes things much cleaner (and less circular!)

BUT: I've spent 15 minutes reading and I'm half-way through the first commit, in which I've found at least three orthogonal changes.

Please could you split up the commit? And the "basic cleanup" stuff is perfectly sensible, but it can be in parallel PRs.


try:
os.makedirs(arg_scratch_root, exist_ok=True)
Path(arg_scratch_root).mkdir(exist_ok=True, parents=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal, but things like the changes to this particular file could definitely have been broken out into their own commit, I think?

Indeed (reading further...) the are several other parts of this commit that are all about converting to using Path. It's completely reasonable to do so, but please can you avoid polluting an otherwise-nontrivial commit with extra gubbins?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's not crucial, but is just a bit easier on reviewers when non-functional changes are split up into a separate commit marked as such. I personally find it easier to scrutinize the important changes with less noise, partially contributed to by the GitHub UI being challenging at times.
Also, this change is like the first thing that appears in the list, so it's easier to nitpick about on that basis too :)

Comment on lines +595 to +596
job_time_s = tr.job_runtime
sim_time_us = tr.simulated_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change part of switching to the JobSpec abstraction? I suspect not, so this should be a separate commit. Also, aren't we switching to using JobTime objects? Probably drop the units from the variable names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is part of the JobSpec abstraction, due to the coupled nature of the original code. The JobSpec objects needs to be immutable and serialisable as pydantic model, JobTime objects are mutable and are not serialisable.
So the JobTime can't be stored within a JobSpec.

The only place I can see in the codebase where the JobTime is used downstream is this place, and they convert to fixed units. So rather than using JobTime, I've reverted to just using simple integers.

Long term we could fix up JobTime so it's immutable and serialisable, although I'm not sure if there is an actual use-case for this? Python has built-in datetime data types, so why don't we just use those?

Either way, I think this is a separate PR. This one is already too big as you correctly point out.
I've tried to make minimal changes (although did mix some linting fixes in).

scratch_path: Path


class JobSpec(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add docstrings to the various members that this adds? In many cases, this will just be a matter of moving the comment to come just after the name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstrings for the class attributes as best as I can.


# TODO: Migrate to Time instead of a custom implementation
class JobTime: # noqa: PLW1641 # Muitable object should not implement __hash__
class JobTime: # noqa: PLW1641 Muitable object should not implement __hash__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup, but....

  1. It should be a separate commit
  2. "Muitable" still isn't a word :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted my change... it can be fixed in another commit as you ask (at some other time).

@machshev
Copy link
Collaborator Author

As a general point, I'm very excited about this sort of refactor: it makes things much cleaner (and less circular!)

BUT: I've spent 15 minutes reading and I'm half-way through the first commit, in which I've found at least three orthogonal changes.

Please could you split up the commit? And the "basic cleanup" stuff is perfectly sensible, but it can be in parallel PRs.

Other than mixing linting with refactoring, I can’t see a way of splitting this up. It might not be obvious why the changes are needed, and often it seems unrelated. But this is because of the coupled nature of the code.
This is the 3rd attempt to refactor this part of the code... the other two attempts failed because changing one part causes unintended side effects in other areas of the code.

I can try and pull out the linting fixes now there is a working end state, but I don't think I can split up the main changes more than they currently can (without creating commits with broken states).

@rswarbrick
Copy link
Contributor

Really?! Are you really saying that e.g. the change to to using Path requires you to implement the JobSpec class that's in the first commit? That seems extremely surprising to me.

@machshev machshev force-pushed the scheduler_deploy_refactor branch from ec408d8 to 4f03c0d Compare November 10, 2025 17:31
@machshev
Copy link
Collaborator Author

As a general point, I'm very excited about this sort of refactor: it makes things much cleaner (and less circular!)

BUT: I've spent 15 minutes reading and I'm half-way through the first commit, in which I've found at least three orthogonal changes.

Please could you split up the commit? And the "basic cleanup" stuff is perfectly sensible, but it can be in parallel PRs.

Other than mixing linting with refactoring, I can’t see a way of splitting this up. It might not be obvious why the changes are needed, and often it seems unrelated. But this is because of the coupled nature of the code.
This is the 3rd attempt to refactor this part of the code... the other two attempts failed because changing one part causes unintended side effects in other areas of the code.

I can try and pull out the linting fixes now there is a working end state, but I don't think I can split up the main changes more than they currently can (without creating commits with broken states).

Really?! Are you really saying that e.g. the change to to using Path requires you to implement the JobSpec class that's in the first commit? That seems extremely surprising to me.

No, as per the description of the PR, the point of this PR is to implement the JobSpec class. There are some necessary Path changes, due to the attributes on the JobSpec becoming Path objects rather than str... these are inseparable and necessary (Path cannot be concatenated with str).

There are also a hand full that are not necessary for the purpose of this PR, which I made on the way to fix linting warnings over the several days of effort it took to make these changes. My comment about "Other than mixing linting with refactoring" includes the Path change you give as an example.

@machshev machshev force-pushed the scheduler_deploy_refactor branch from 4f03c0d to 7b8f61c Compare November 11, 2025 11:33
Copy link
Contributor

@hcallahan-lowrisc hcallahan-lowrisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks @machshev! This feels like a nice abstraction layer to expand into.


try:
os.makedirs(arg_scratch_root, exist_ok=True)
Path(arg_scratch_root).mkdir(exist_ok=True, parents=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's not crucial, but is just a bit easier on reviewers when non-functional changes are split up into a separate commit marked as such. I personally find it easier to scrutinize the important changes with less noise, partially contributed to by the GitHub UI being challenging at times.
Also, this change is like the first thing that appears in the list, so it's easier to nitpick about on that basis too :)

# Licensed under the Apache License, Version 2.0, see LICENSE for details.
# SPDX-License-Identifier: Apache-2.0

"""Job data models."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this might not be the final refactoring, but could you expand the description here with some more context for how these objects are used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with some context.

self.prepare_workspace_for_cfg(workspace_cfg)
Launcher.workspace_prepared_for_cfg.add(project)

# Store the deploy object handle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. stale comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot

]

deploy.cov_results_dict = {k: f"{random() * 100:.2f} %" for k in keys}
# TODO: this hack doesn't work any more and needs implementing by writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does removing this leave us broken in some way? If so, I would delete the code and create an issue to track. If not, maybe just delete the code entirely?

Copy link
Collaborator Author

@machshev machshev Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't leave us in a broken state, just means the fake data from the fake launcher doesn't contain coverage data... so report template generation is not quite as nice. It still creates random pass/fail results for the report.

I'm sure we can add the functionality back in, just not quite sure how at the moment.

The `Deploy` classes are coupled to the scheduler and also the flow
objects in a way that is not clear. It's also not clear which of the
public attributes are depended on externally and which are internal
working out.

The intention is to make clear what the interface is that the scheduler
depends on and what the flow object depends on as a result of the job
runs. Initially I tried to refactor the `Deploy` objects themselves to
make the existing objects clear, however this failed due to too many
moving parts and the level of coupling.

This commit introduces a `JobSpec` pydantic data model that gives a fully
typed and runtime validated model containing everything the scheduler
needs to run the job. The scheduler no longer receives `Deploy` objects,
but instead `JobSpec` objects created from `Deploy` objects. Which means
`Deploy` objects become `JobSpec` factories.

In addition, the obvious dependencies on `Deploy` objects within the
flow objects have been replaced with an expanded `CompletedJobStatus`
pydantic model. There may still be dependencies that exist that have
been missed, but the main ones have been removed in the results
processing path.

This opens the way for writing tests for the scheduler, and potentially
the launchers, against clear interfaces using data classes as inputs and
outputs.

There is room for improvement in the data classes, as some attributes can
potentially be removed with some refactoring.

Signed-off-by: James McCorrie <[email protected]>
Signed-off-by: James McCorrie <[email protected]>
Now we have a Pydantic model that represents the full data requirement
of the scheduler, we can use the builtin `JobSpec.model_dump()` instead of
the custom `Deploy.dump()`. At the moment this model has to contain a
couple of callback functions which cannot be serialised, so these
attributes are excluded.

This commit does change the format of the dumped "deployment" objects
file. So if this is being used to check for breaking changes, then
hashes need to be compared against one generated from this commit from
now on.

Signed-off-by: James McCorrie <[email protected]>
@machshev machshev force-pushed the scheduler_deploy_refactor branch from 7b8f61c to 7bd5fba Compare November 11, 2025 16:02
@machshev machshev added this pull request to the merge queue Nov 11, 2025
Merged via the queue into lowRISC:master with commit 91ac90e Nov 11, 2025
6 checks passed
@machshev machshev deleted the scheduler_deploy_refactor branch November 11, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants