Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Create new job groups without ability to add jobs #14018

Closed
wants to merge 35 commits into from

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Nov 16, 2023

Stacked on #14016. This PR needs to have the client/server protocol for creating job groups for the four types of creation/update events hashed out and implemented. Basic tests are there. We still need tests for billing and cancellation to make sure the aggregation and cancellation operations work properly.

@jigold jigold force-pushed the create-job-groups branch from 315d291 to 68168ca Compare December 1, 2023 00:15
@jigold jigold changed the title [batch] WIP -- Create job groups [batch] Create new job groups without ability to add jobs Dec 1, 2023
@jigold jigold marked this pull request as ready for review December 1, 2023 00:16
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not cover everything because I think some new commits came in while I was reviewing. The majority of lines changed here look great, I just have a couple core concerns with the API changes that I think are two strongly driven by the *-fast endpoints. In particular, I think update-fast is doing a lot here. update-fast is a convenience for bundling up the following steps:

  1. Create an update of jobs
  2. Submit jobs
  3. Commit the update of those jobs

I might be wrong but it seems to me like that is now the de-facto way in aioclient to also submit just a job group with no jobs and that seems wrong. I think we would be better served by starting simpler and adding the ability to bundle operations once the right "primitives" are established. I think we would not be losing much if we just left all the update related operations alone and added a route for creating job groups and we would reduce the complexity greatly. We can then revisit by adding a way to "create a job group and a jobs update in one request"

'callback': nullable(str_type),
'cancel_after_n_failures': nullable(numeric(**{"x > 0": lambda x: isinstance(x, int) and x > 0})),
'absolute_parent_id': nullable(int_type),
'in_update_parent_id': nullable(int_type),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the semantics of this PR are sound but this name trips me up, especially above when the job spec contains the in_update_job_group_id. I think there are two distinct concepts here that are both being given the name "update" and I want to make sure they're not conflated. Am I correct that this field can only refer to job groups that are submitted together within the same HTTP request to the front-end? Because we are not reserving ranges of job group IDs in the database there is no way to resolve a relative ID across different requests. So this is more like an in_request_parent_id.

I see why you initially had this as part of the batch_updates table, so that you could use relative IDs in this way, but I don't see this being a heavily used feature right away and I hesitate to introduce changes that are not heavily used and therefore heavily tested. For QOB, we are basically never going to submit nested job groups in the same request. The absolute parent job group ID is always known because it is the job group the query driver is running in.

Additionally, wasn't there a conversation of limiting the depth of a job group tree? If so, there would be a small upper bound on the number of requests necessary to submit job group specs one layer of the tree at a time, which I think disincentivizes this feature even more. What do you think about removing in_update_parent_id for now?


def _submit(self, in_update_start_job_group_id: int):
self._raise_if_submitted()
self._job_group_id = AbsoluteJobId(in_update_start_job_group_id + self._job_group_id - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._job_group_id = AbsoluteJobId(in_update_start_job_group_id + self._job_group_id - 1)
self._job_group_id = AbsoluteJobGroupId(in_update_start_job_group_id + self._job_group_id - 1)

right?

cancel_after_n_failures=cancel_after_n_failures)

self._job_groups.append(jg)
self._job_group_specs.append(jg_spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of keeping these in sync, can you just have the spec be a field of the job group? You could also then not deduplicate these fields like attributes that are set on both objects.

return int(update_json['start_job_id'])

start_job_group_id = update_json['start_job_group_id']
if start_job_group_id is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is None if we don't submit any job groups?


json_resp = await resp.json()
start_job_group_id = json_resp['start_job_group_id']
if start_job_group_id is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this be None here?

return start_job_id
if n_job_bunches == 0 and n_job_group_bunches == 0:
log.warning('Tried to submit an update with 0 jobs and 0 job groups. Doing nothing.')
return (None, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old way of structuring this method is not scaling very well to also including job groups. It's muddling the logic of what to do with just jobs/updates which I think is introducing subtle bugs. If someone tries to create a job group and no jobs, I think it will still call self._create_update, which is illegal with no jobs right?

if len(self._job_groups) > 0:
await self._submit_job_group_bunches(byte_job_group_specs_bunches, job_group_bunch_sizes, job_group_progress_task)
# we need to recompute the specs now that the job group ID is absolute
job_specs = [spec.to_dict() for spec in self._job_specs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some hidden mutation here that means running this code again produces a different result? That's pretty spooky and tells me that this should be restructured.

start_job_id = await self._commit_update(update_id)
self._submission_info = BatchSubmissionInfo(used_fast_path=False)
log.info(f'updated batch {self.id}')
return (start_job_id, start_job_group_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to stew a bit on this _submit method. It seems like a lot of complexity most of which we will possibly never use in QoB, which makes it hard for me to reason about. For context, the Scala side of things will just need to add a single piece of information that is "put all these jobs in a new job group that is a child of my current job group.

It really feels to me like the *-fast methods are driving the whole design when they're really just an optimization to batch server operations into fewer HTTP requests. Maybe we should start with just resigning to two HTTP requests and have _submit be two stages:

  1. Create job group(s)
  2. Do everything like it did before

I recognize that it will be nice to have one HTTP request for small submissions but to me that feels like a much easier optimization to make once the existing operations are in place.

@danking danking self-assigned this Jan 11, 2024
@danking
Copy link
Contributor

danking commented Jan 11, 2024

In the spirit of spreading out review burden a bit, I'm gonna pick up review for this PR. I'll be sure to read up on all the stacked conversations so I'm cache'd in before I review this one. I'll stick a review on here and you can dismiss when this is ready for a look :)

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dismiss when ready for review

@jigold
Copy link
Contributor Author

jigold commented Feb 1, 2024

Closing in favor of #14170

@jigold jigold closed this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants