-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External references #392
External references #392
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #392 +/- ##
=======================================
Coverage 99.86% 99.86%
=======================================
Files 19 19
Lines 1505 1512 +7
Branches 374 378 +4
=======================================
+ Hits 1503 1510 +7
Misses 1 1
Partials 1 1
|
@janosh: Sounds like this might address a stumbling block you ran into earlier. |
Good question, not 100% sure. Could be that external references were the problem Aaron and I ran into when calculating # ensure that we have all the jobs needed to resolve the reference connections
job_references = find_and_get_references(flow.jobs)
job_reference_uuids = {ref.uuid for ref in job_references}
missing_jobs = job_reference_uuids.difference(set(flow.job_uuids))
if len(missing_jobs) > 0:
raise ValueError(
"The following jobs were not found in the jobs array and are needed to "
f"resolve output references:\n{list(missing_jobs)}"
) |
Could you have been trying to use an output reference in the maker class kwargs rather than in the make function? Currently the former is not supported.
I agree that this is a useful feature and looks like the implementation is nice. However, I don't think the default behaviour should change. In most cases, people will not be using external references, so if there are missing references this will likely be a bug that should be caught and a proper error message presented. Currently, it looks like there isn't a way to override the |
Thanks for the comments. I made the change that you suggested. |
Great, thank you! |
Summary
Currently Jobflow prevents using an
OutputReference
as an input for a Job/Flow if the Job being referred does not belong to the same Flow.However, I think there are cases where it could be useful to pass the reference to the output of a Job/Flow that has finished previously.
Of course the use of the external reference could be avoided by fetching the output of the first Flow and pass it as an input of the new Flow. But what if the output is very large and does not fit in the MongoDB size limit? Allowing external references will allow to avoid this kind of issues, avoid repetition of the data in the DB and potentially allow to reconstruct connections between different flows.
I thought it would have been easier to test this option and directly open a PR rather than discussing it in an issue beforehand. Is this change acceptable?
An obvious downside would be that a user may mistakenly end up with connected Flows without the guarantee of the correct order of execution. Do you think this would be a blocking issue? Or any suggestion about how to better handle it?
Here an example of usage of the code with an external reference:
TODO
Add specific tests if the change is acceptable.