Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Flow's __deepcopy__ method #398

Closed
gpetretto opened this issue Aug 15, 2023 · 4 comments · Fixed by #399
Closed

Problem with Flow's __deepcopy__ method #398

gpetretto opened this issue Aug 15, 2023 · 4 comments · Fixed by #399

Comments

@gpetretto
Copy link
Contributor

PR #369 introduced the __deepcopy__ magic method for Flow, but there are a few issues with it.

The most important is that it breaks when the output of the Flow is defined. A minimal example leading to the error:

from copy import deepcopy
from jobflow import job, Flow

@job
def add(a, b):
    return a + b

add_first = add(1, 5)
add_second = add(add_first.output, 3)
flow = Flow([add_first, add_second], output=add_second.output)

copied_flow = deepcopy(flow)

giving:

Traceback (most recent call last):
File "problem_output_ref.py", line 16, in
copied_flow = deepcopy(flow)
File "/lib/python3.10/copy.py", line 153, in deepcopy
y = copier(memo)
File "/jobflow/src/jobflow/core/flow.py", line 230, in deepcopy
new_flow = Flow(jobs=[], **kwds)
File "/jobflow/src/jobflow/core/flow.py", line 147, in init
self.output = output
File "/jobflow/src/jobflow/core/flow.py", line 303, in output
raise ValueError(
ValueError: jobs array does not contain all jobs needed for flow output

The reason being the fact the Flow is created with no jobs but with the outputs:

new_flow = Flow(jobs=[], **kwds)

(note that this breaks some atomate2 tests)

The second issue that I identified is that the hosts are not handled correctly. Here an example:

from copy import deepcopy
from jobflow import job, Flow


@job
def add(a, b):
    return a + b

add_first = add(1, 5)
add_second = add(add_first.output, 3)
flow = Flow(Flow([add_first, add_second]))
print(flow.jobs[0].jobs[0].hosts)

copied_flow = deepcopy(flow)
print(copied_flow.jobs[0].jobs[0].hosts)

Whose output is:

['ea7c4ab0-3dd6-49e8-abcf-087502dd36cf', '115a9824-0ada-4567-b3d5-1c96de672c53']
['ea7c4ab0-3dd6-49e8-abcf-087502dd36cf']

showing that one of the host ids is lost.
A point that is not really clear is why the hosts are overwritten in the first place:

for job in jobs:
job.hosts = [new_flow.uuid]

I am under the impression that the original idea was that the copied flow should have a different uuid from the original one, but this is not the case, since one of the kwds passed to Flow is the uuid.

In the end, I am not sure which benefit brings implementing __deepcopy__ instead of using the deepcopy from the standard library. @janosh, can you comment on this?

@utf
Copy link
Member

utf commented Aug 15, 2023

Thanks for noticing this. I would be in favour of reverting the deepcopy implementation. It seems very specific and if we add new fields then we will also have to implement them in __deepcopy__?

@janosh
Copy link
Member

janosh commented Aug 15, 2023

Good catch. I didn't have any specific use case in mind, just wanted to use it in a test but in hindsight, using copy.deepcopy would have been the better choice.

@gpetretto
Copy link
Contributor Author

Thanks for the quick fix!

@utf
Copy link
Member

utf commented Aug 16, 2023

Just pushed a new version with the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants