Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External references #392

Merged
merged 4 commits into from
Aug 14, 2023
Merged

External references #392

merged 4 commits into from
Aug 14, 2023

Conversation

gpetretto
Copy link
Contributor

Summary

Currently Jobflow prevents using an OutputReference as an input for a Job/Flow if the Job being referred does not belong to the same Flow.

However, I think there are cases where it could be useful to pass the reference to the output of a Job/Flow that has finished previously.
Of course the use of the external reference could be avoided by fetching the output of the first Flow and pass it as an input of the new Flow. But what if the output is very large and does not fit in the MongoDB size limit? Allowing external references will allow to avoid this kind of issues, avoid repetition of the data in the DB and potentially allow to reconstruct connections between different flows.

I thought it would have been easier to test this option and directly open a PR rather than discussing it in an issue beforehand. Is this change acceptable?
An obvious downside would be that a user may mistakenly end up with connected Flows without the guarantee of the correct order of execution. Do you think this would be a blocking issue? Or any suggestion about how to better handle it?

Here an example of usage of the code with an external reference:

from jobflow import job, Flow, OutputReference
from jobflow.managers.local import run_locally

@job
def add(a, b):
    return a + b

add_first = add(1, 5)
add_second = add(add_first.output, 3)

flow1 = Flow([add_first, add_second])
responses1 = run_locally(flow1)

add_third = add(OutputReference(add_second.uuid), 5)

flow2 = Flow([add_third])
responses2 = run_locally(flow2)

TODO

Add specific tests if the change is acceptable.

@codecov
Copy link

codecov bot commented Aug 8, 2023

Codecov Report

Merging #392 (10ed364) into main (0fbabf3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #392   +/-   ##
=======================================
  Coverage   99.86%   99.86%           
=======================================
  Files          19       19           
  Lines        1505     1512    +7     
  Branches      374      378    +4     
=======================================
+ Hits         1503     1510    +7     
  Misses          1        1           
  Partials        1        1           
Files Changed Coverage Δ
src/jobflow/core/flow.py 100.00% <100.00%> (ø)
src/jobflow/managers/fireworks.py 100.00% <100.00%> (ø)
src/jobflow/managers/local.py 100.00% <100.00%> (ø)

@Andrew-S-Rosen
Copy link
Member

@janosh: Sounds like this might address a stumbling block you ran into earlier.

@janosh
Copy link
Member

janosh commented Aug 12, 2023

Good question, not 100% sure. Could be that external references were the problem Aaron and I ran into when calculating kspacing based on the previous job's bandgap. We spent some time fruitlessly debugging that until @utf's showed how to do it in a much better way. But if that was our issue, I don't understand why we didn't run into any errors like this:

    # ensure that we have all the jobs needed to resolve the reference connections
    job_references = find_and_get_references(flow.jobs)
    job_reference_uuids = {ref.uuid for ref in job_references}
    missing_jobs = job_reference_uuids.difference(set(flow.job_uuids))
    if len(missing_jobs) > 0:
        raise ValueError(
            "The following jobs were not found in the jobs array and are needed to "
            f"resolve output references:\n{list(missing_jobs)}"
        )

@utf
Copy link
Member

utf commented Aug 13, 2023

Could be that external references were the problem Aaron and I ran into when calculating kspacing based on the previous job's bandgap.

Could you have been trying to use an output reference in the maker class kwargs rather than in the make function? Currently the former is not supported.

However, I think there are cases where it could be useful to pass the reference to the output of a Job/Flow that has finished previously.

I agree that this is a useful feature and looks like the implementation is nice. However, I don't think the default behaviour should change. In most cases, people will not be using external references, so if there are missing references this will likely be a bug that should be caught and a proper error message presented.

Currently, it looks like there isn't a way to override the allow_external_references within run_locally or flow_to_workflow. My suggestion would be to set the default to False and then add override settings to these two functions.

@gpetretto
Copy link
Contributor Author

Thanks for the comments. I made the change that you suggested.
I had set allow_external_references to True because it means that if a workflow that makes use of an external reference is generated, the user should also be aware of the need of setting that variable to True when converting it. And this might not be so straightforward.
I also added some more tests, so this should be ready to be merged.

@utf
Copy link
Member

utf commented Aug 14, 2023

Great, thank you!

@utf utf merged commit c9d9e81 into materialsproject:main Aug 14, 2023
8 checks passed
@utf utf added the enhancement New feature or request label Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants