-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Create new job groups without ability to add jobs #14018
Conversation
315d291
to
68168ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not cover everything because I think some new commits came in while I was reviewing. The majority of lines changed here look great, I just have a couple core concerns with the API changes that I think are two strongly driven by the *-fast
endpoints. In particular, I think update-fast
is doing a lot here. update-fast
is a convenience for bundling up the following steps:
- Create an update of jobs
- Submit jobs
- Commit the update of those jobs
I might be wrong but it seems to me like that is now the de-facto way in aioclient
to also submit just a job group with no jobs and that seems wrong. I think we would be better served by starting simpler and adding the ability to bundle operations once the right "primitives" are established. I think we would not be losing much if we just left all the update
related operations alone and added a route for creating job groups and we would reduce the complexity greatly. We can then revisit by adding a way to "create a job group and a jobs update in one request"
'callback': nullable(str_type), | ||
'cancel_after_n_failures': nullable(numeric(**{"x > 0": lambda x: isinstance(x, int) and x > 0})), | ||
'absolute_parent_id': nullable(int_type), | ||
'in_update_parent_id': nullable(int_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the semantics of this PR are sound but this name trips me up, especially above when the job spec contains the in_update_job_group_id
. I think there are two distinct concepts here that are both being given the name "update" and I want to make sure they're not conflated. Am I correct that this field can only refer to job groups that are submitted together within the same HTTP request to the front-end? Because we are not reserving ranges of job group IDs in the database there is no way to resolve a relative ID across different requests. So this is more like an in_request_parent_id
.
I see why you initially had this as part of the batch_updates
table, so that you could use relative IDs in this way, but I don't see this being a heavily used feature right away and I hesitate to introduce changes that are not heavily used and therefore heavily tested. For QOB, we are basically never going to submit nested job groups in the same request. The absolute parent job group ID is always known because it is the job group the query driver is running in.
Additionally, wasn't there a conversation of limiting the depth of a job group tree? If so, there would be a small upper bound on the number of requests necessary to submit job group specs one layer of the tree at a time, which I think disincentivizes this feature even more. What do you think about removing in_update_parent_id
for now?
|
||
def _submit(self, in_update_start_job_group_id: int): | ||
self._raise_if_submitted() | ||
self._job_group_id = AbsoluteJobId(in_update_start_job_group_id + self._job_group_id - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._job_group_id = AbsoluteJobId(in_update_start_job_group_id + self._job_group_id - 1) | |
self._job_group_id = AbsoluteJobGroupId(in_update_start_job_group_id + self._job_group_id - 1) |
right?
cancel_after_n_failures=cancel_after_n_failures) | ||
|
||
self._job_groups.append(jg) | ||
self._job_group_specs.append(jg_spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of keeping these in sync, can you just have the spec be a field of the job group? You could also then not deduplicate these fields like attributes
that are set on both objects.
return int(update_json['start_job_id']) | ||
|
||
start_job_group_id = update_json['start_job_group_id'] | ||
if start_job_group_id is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is None if we don't submit any job groups?
|
||
json_resp = await resp.json() | ||
start_job_group_id = json_resp['start_job_group_id'] | ||
if start_job_group_id is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this be None
here?
return start_job_id | ||
if n_job_bunches == 0 and n_job_group_bunches == 0: | ||
log.warning('Tried to submit an update with 0 jobs and 0 job groups. Doing nothing.') | ||
return (None, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the old way of structuring this method is not scaling very well to also including job groups. It's muddling the logic of what to do with just jobs/updates which I think is introducing subtle bugs. If someone tries to create a job group and no jobs, I think it will still call self._create_update
, which is illegal with no jobs right?
if len(self._job_groups) > 0: | ||
await self._submit_job_group_bunches(byte_job_group_specs_bunches, job_group_bunch_sizes, job_group_progress_task) | ||
# we need to recompute the specs now that the job group ID is absolute | ||
job_specs = [spec.to_dict() for spec in self._job_specs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some hidden mutation here that means running this code again produces a different result? That's pretty spooky and tells me that this should be restructured.
start_job_id = await self._commit_update(update_id) | ||
self._submission_info = BatchSubmissionInfo(used_fast_path=False) | ||
log.info(f'updated batch {self.id}') | ||
return (start_job_id, start_job_group_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to stew a bit on this _submit
method. It seems like a lot of complexity most of which we will possibly never use in QoB, which makes it hard for me to reason about. For context, the Scala side of things will just need to add a single piece of information that is "put all these jobs in a new job group that is a child of my current job group.
It really feels to me like the *-fast
methods are driving the whole design when they're really just an optimization to batch server operations into fewer HTTP requests. Maybe we should start with just resigning to two HTTP requests and have _submit
be two stages:
- Create job group(s)
- Do everything like it did before
I recognize that it will be nice to have one HTTP request for small submissions but to me that feels like a much easier optimization to make once the existing operations are in place.
In the spirit of spreading out review burden a bit, I'm gonna pick up review for this PR. I'll be sure to read up on all the stacked conversations so I'm cache'd in before I review this one. I'll stick a review on here and you can dismiss when this is ready for a look :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dismiss when ready for review
Closing in favor of #14170 |
Stacked on #14016. This PR needs to have the client/server protocol for creating job groups for the four types of creation/update events hashed out and implemented. Basic tests are there. We still need tests for billing and cancellation to make sure the aggregation and cancellation operations work properly.