refactor: add JobSpec common abstraction #49

machshev · 2025-11-06T14:26:05Z

The Deploy classes are coupled to the scheduler and also the flow objects in a way that is not clear. It's also not clear which of the public attributes are depended on externally and which are internal working out.

The intention is to make clear what the interface is that the scheduler depends on and what the flow object depends on as a result of the job runs. Initially I tried to refactor the Deploy objects themselves to make the existing objects clear, however this failed due to too many moving parts and the level of coupling.

This commit introduces a JobSpec pydantic data model that gives a fully typed and runtime validated model containing everything the scheduler needs to run the job. The scheduler no longer receives Deploy objects, but instead JobSpec objects created from Deploy objects. Which means Deploy objects become JobSpec factories.

In addition, the obvious dependencies on Deploy objects within the flow objects have been replaced with an expanded CompletedJobStatus pydantic model. There may still be dependencies that exist that have been missed, but the main ones have been removed in the results processing path.

flowchart LR
    subgraph flowS [flows]
        F1[Flow 1]
        F2[Flow 2]
    end

    subgraph Deploy
        D1(Deploy 1)
        D2(Deploy 2)
        D3(Deploy 3)
        J1(JobSpec 1)
        J2(JobSpec 2)
        J3(JobSpec 3)
    end

    subgraph Results
      C1
      C2
      C3
    end

    subgraph flowE [flows]
        F1e[Flow 1]
        F2e[Flow 2]
    end

    F1 --> D1 --> J1 --> S[Scheduler] --> C1(CompletedJobStatus) --> F1e
    F1 --> D2 --> J2 --> S --> C2(CompletedJobStatus) --> F1e
    F2 --> D3 --> J3 --> S --> C3(CompletedJobStatus) --> F2e

This opens the way for writing tests for the scheduler, and potentially the launchers, against clear interfaces using data classes as inputs and outputs.

There is room for improvement in the data classes, as some attributes can potentially be removed with some refactoring.

Implements #46

rswarbrick

As a general point, I'm very excited about this sort of refactor: it makes things much cleaner (and less circular!)

BUT: I've spent 15 minutes reading and I'm half-way through the first commit, in which I've found at least three orthogonal changes.

Please could you split up the commit? And the "basic cleanup" stuff is perfectly sensible, but it can be in parallel PRs.

rswarbrick · 2025-11-10T13:26:44Z

src/dvsim/cli.py


    try:
-        os.makedirs(arg_scratch_root, exist_ok=True)
+        Path(arg_scratch_root).mkdir(exist_ok=True, parents=True)


No big deal, but things like the changes to this particular file could definitely have been broken out into their own commit, I think?

Indeed (reading further...) the are several other parts of this commit that are all about converting to using Path. It's completely reasonable to do so, but please can you avoid polluting an otherwise-nontrivial commit with extra gubbins?

Yeah it's not crucial, but is just a bit easier on reviewers when non-functional changes are split up into a separate commit marked as such. I personally find it easier to scrutinize the important changes with less noise, partially contributed to by the GitHub UI being challenging at times.
Also, this change is like the first thing that appears in the list, so it's easier to nitpick about on that basis too :)

rswarbrick · 2025-11-10T13:34:27Z

src/dvsim/flow/sim.py

+            job_time_s = tr.job_runtime
+            sim_time_us = tr.simulated_time


Is this change part of switching to the JobSpec abstraction? I suspect not, so this should be a separate commit. Also, aren't we switching to using JobTime objects? Probably drop the units from the variable names?

Yes this is part of the JobSpec abstraction, due to the coupled nature of the original code. The JobSpec objects needs to be immutable and serialisable as pydantic model, JobTime objects are mutable and are not serialisable.
So the JobTime can't be stored within a JobSpec.

The only place I can see in the codebase where the JobTime is used downstream is this place, and they convert to fixed units. So rather than using JobTime, I've reverted to just using simple integers.

Long term we could fix up JobTime so it's immutable and serialisable, although I'm not sure if there is an actual use-case for this? Python has built-in datetime data types, so why don't we just use those?

Either way, I think this is a separate PR. This one is already too big as you correctly point out.
I've tried to make minimal changes (although did mix some linting fixes in).

rswarbrick · 2025-11-10T13:37:25Z

src/dvsim/job/data.py

+    scratch_path: Path
+
+
+class JobSpec(BaseModel):


Could you add docstrings to the various members that this adds? In many cases, this will just be a matter of moving the comment to come just after the name.

Added docstrings for the class attributes as best as I can.

rswarbrick · 2025-11-10T13:38:41Z

src/dvsim/job/time.py


 # TODO: Migrate to Time instead of a custom implementation
-class JobTime:  # noqa: PLW1641 # Muitable object should not implement __hash__
+class JobTime:  # noqa: PLW1641 Muitable object should not implement __hash__


Thanks for the cleanup, but....

It should be a separate commit

"Muitable" still isn't a word :-)

Reverted my change... it can be fixed in another commit as you ask (at some other time).

machshev · 2025-11-10T14:48:10Z

As a general point, I'm very excited about this sort of refactor: it makes things much cleaner (and less circular!)

BUT: I've spent 15 minutes reading and I'm half-way through the first commit, in which I've found at least three orthogonal changes.

Please could you split up the commit? And the "basic cleanup" stuff is perfectly sensible, but it can be in parallel PRs.

Other than mixing linting with refactoring, I can’t see a way of splitting this up. It might not be obvious why the changes are needed, and often it seems unrelated. But this is because of the coupled nature of the code.
This is the 3rd attempt to refactor this part of the code... the other two attempts failed because changing one part causes unintended side effects in other areas of the code.

I can try and pull out the linting fixes now there is a working end state, but I don't think I can split up the main changes more than they currently can (without creating commits with broken states).

rswarbrick · 2025-11-10T15:18:58Z

Really?! Are you really saying that e.g. the change to to using Path requires you to implement the JobSpec class that's in the first commit? That seems extremely surprising to me.

machshev · 2025-11-10T17:49:38Z

As a general point, I'm very excited about this sort of refactor: it makes things much cleaner (and less circular!)

BUT: I've spent 15 minutes reading and I'm half-way through the first commit, in which I've found at least three orthogonal changes.

Please could you split up the commit? And the "basic cleanup" stuff is perfectly sensible, but it can be in parallel PRs.

Other than mixing linting with refactoring, I can’t see a way of splitting this up. It might not be obvious why the changes are needed, and often it seems unrelated. But this is because of the coupled nature of the code.
This is the 3rd attempt to refactor this part of the code... the other two attempts failed because changing one part causes unintended side effects in other areas of the code.

I can try and pull out the linting fixes now there is a working end state, but I don't think I can split up the main changes more than they currently can (without creating commits with broken states).

Really?! Are you really saying that e.g. the change to to using Path requires you to implement the JobSpec class that's in the first commit? That seems extremely surprising to me.

No, as per the description of the PR, the point of this PR is to implement the JobSpec class. There are some necessary Path changes, due to the attributes on the JobSpec becoming Path objects rather than str... these are inseparable and necessary (Path cannot be concatenated with str).

There are also a hand full that are not necessary for the purpose of this PR, which I made on the way to fix linting warnings over the several days of effort it took to make these changes. My comment about "Other than mixing linting with refactoring" includes the Path change you give as an example.

hcallahan-lowrisc

Overall LGTM, thanks @machshev! This feels like a nice abstraction layer to expand into.

hcallahan-lowrisc · 2025-11-11T11:47:27Z

src/dvsim/cli.py


    try:
-        os.makedirs(arg_scratch_root, exist_ok=True)
+        Path(arg_scratch_root).mkdir(exist_ok=True, parents=True)


Yeah it's not crucial, but is just a bit easier on reviewers when non-functional changes are split up into a separate commit marked as such. I personally find it easier to scrutinize the important changes with less noise, partially contributed to by the GitHub UI being challenging at times.
Also, this change is like the first thing that appears in the list, so it's easier to nitpick about on that basis too :)

hcallahan-lowrisc · 2025-11-11T12:07:09Z

src/dvsim/job/data.py

+# Licensed under the Apache License, Version 2.0, see LICENSE for details.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Job data models."""


I know this might not be the final refactoring, but could you expand the description here with some more context for how these objects are used?

Updated with some context.

hcallahan-lowrisc · 2025-11-11T12:14:16Z

src/dvsim/launcher/base.py

            self.prepare_workspace_for_cfg(workspace_cfg)
            Launcher.workspace_prepared_for_cfg.add(project)

        # Store the deploy object handle.


Nit. stale comment

hcallahan-lowrisc · 2025-11-11T15:15:10Z

src/dvsim/launcher/fake.py

-    ]
-
-    deploy.cov_results_dict = {k: f"{random() * 100:.2f} %" for k in keys}
+    # TODO: this hack doesn't work any more and needs implementing by writing


Does removing this leave us broken in some way? If so, I would delete the code and create an issue to track. If not, maybe just delete the code entirely?

It doesn't leave us in a broken state, just means the fake data from the fake launcher doesn't contain coverage data... so report template generation is not quite as nice. It still creates random pass/fail results for the report.

I'm sure we can add the functionality back in, just not quite sure how at the moment.

The `Deploy` classes are coupled to the scheduler and also the flow objects in a way that is not clear. It's also not clear which of the public attributes are depended on externally and which are internal working out. The intention is to make clear what the interface is that the scheduler depends on and what the flow object depends on as a result of the job runs. Initially I tried to refactor the `Deploy` objects themselves to make the existing objects clear, however this failed due to too many moving parts and the level of coupling. This commit introduces a `JobSpec` pydantic data model that gives a fully typed and runtime validated model containing everything the scheduler needs to run the job. The scheduler no longer receives `Deploy` objects, but instead `JobSpec` objects created from `Deploy` objects. Which means `Deploy` objects become `JobSpec` factories. In addition, the obvious dependencies on `Deploy` objects within the flow objects have been replaced with an expanded `CompletedJobStatus` pydantic model. There may still be dependencies that exist that have been missed, but the main ones have been removed in the results processing path. This opens the way for writing tests for the scheduler, and potentially the launchers, against clear interfaces using data classes as inputs and outputs. There is room for improvement in the data classes, as some attributes can potentially be removed with some refactoring. Signed-off-by: James McCorrie <[email protected]>

Signed-off-by: James McCorrie <[email protected]>

Now we have a Pydantic model that represents the full data requirement of the scheduler, we can use the builtin `JobSpec.model_dump()` instead of the custom `Deploy.dump()`. At the moment this model has to contain a couple of callback functions which cannot be serialised, so these attributes are excluded. This commit does change the format of the dumped "deployment" objects file. So if this is being used to check for breaking changes, then hashes need to be compared against one generated from this commit from now on. Signed-off-by: James McCorrie <[email protected]>

machshev requested a review from hcallahan-lowrisc November 6, 2025 14:26

machshev force-pushed the scheduler_deploy_refactor branch from 3eb2132 to d881ab0 Compare November 6, 2025 14:34

machshev requested a review from rswarbrick November 6, 2025 14:35

machshev mentioned this pull request Nov 6, 2025

[Deploy] Refactor deployment objects to use pydantic models #46

Open

machshev force-pushed the scheduler_deploy_refactor branch from d881ab0 to 1604d40 Compare November 6, 2025 17:18

rswarbrick reviewed Nov 10, 2025

View reviewed changes

machshev force-pushed the scheduler_deploy_refactor branch from ec408d8 to 4f03c0d Compare November 10, 2025 17:31

machshev force-pushed the scheduler_deploy_refactor branch from 4f03c0d to 7b8f61c Compare November 11, 2025 11:33

hcallahan-lowrisc approved these changes Nov 11, 2025

View reviewed changes

machshev added 3 commits November 11, 2025 16:02

chore: nix flake update

ebccaed

Signed-off-by: James McCorrie <[email protected]>

machshev force-pushed the scheduler_deploy_refactor branch from 7b8f61c to 7bd5fba Compare November 11, 2025 16:02

machshev added this pull request to the merge queue Nov 11, 2025

Merged via the queue into lowRISC:master with commit 91ac90e Nov 11, 2025
6 checks passed

machshev deleted the scheduler_deploy_refactor branch November 11, 2025 16:14

machshev mentioned this pull request Nov 12, 2025

Coverage reporting broken #50

Closed

refactor: add JobSpec common abstraction #49

refactor: add JobSpec common abstraction #49

Uh oh!

Conversation

machshev commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rswarbrick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machshev commented Nov 10, 2025

Uh oh!

rswarbrick commented Nov 10, 2025

Uh oh!

machshev commented Nov 10, 2025

Uh oh!

hcallahan-lowrisc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machshev Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

machshev commented Nov 6, 2025 •

edited

Loading

machshev Nov 11, 2025 •

edited

Loading