Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dispatch system for executors #3263

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open

Conversation

csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Oct 24, 2024

Separate out ExprEvalExecutor and HostIrExecutor from what's now called KernelExecutor. Create a dispatch system for them as compile and run are simpler for the former two.

Also renamed instances of FusionExecutorCache to executor_cache, KernelExecutor to ke, ExprEvalExecutor to eee, and HostIrExecutor to hire. It makes this PR large, but was critical to refactor all the instances of these classes.

For review focus on the following files:
csrc/host_ir/executor.[cpp,h]
csrc/runtime/executor.[cpp,h]
csrc/runtime/executor_abstract.h
csrc/runtime/executor_dispatch.[cpp,h]
csrc/runtime/fusion_executor_cache.cpp
csrc/runtime/fusion_kernel_runtime.[cpp,h]

Remaining files are just renaming. I would break this into multiple PRs, but it would be difficult to do at this point.

@@ -845,13 +845,6 @@ bool Fusion::hasDynamicTransform() {
return !ir_utils::getTVsWithDynamicTransform(this).empty();
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved this function to executor.cpp as it wasn't used anywhere else.

@@ -326,17 +326,15 @@ SegmentProfiler::SegmentProfiler(uint32_t id, bool cupti_disabled)
output_bytes_(0),
kernel_profile_state_(ProfilerState::Ready) {}

void SegmentProfiler::startCompile(int device) {
device_ = device;
void SegmentProfiler::startCompile() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated out set device as a separate function. KernelExecutor knows device on compilation since runtime information is needed for it, the other executors set it in run.

@@ -22,6 +22,7 @@ namespace nvfuser {
//! \enum ProfilerState
//! \brief An enum used to represent the state of a profiling state machine
enum class ProfilerState {
None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added this to initialize the state on construction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this is needed. ProfilerState::Ready seems to be a good initial state already -- all reset* functions set the state to that. cc @kevinstephano

csrc/instrumentation.h Outdated Show resolved Hide resolved
inputs,
user_sched.fusion_id_,
user_sched.device_id_);
user_sched.scheduled_fusion.get(), inputs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Ryan's advice here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdspring1 another place I could use your help, please see the comment below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdspring1 Could you take a look here?

csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor.cpp Outdated Show resolved Hide resolved
tests/cpp/test_alias.cpp Outdated Show resolved Hide resolved
…idevice/executor.[cpp,h] and rename to HostIrExecutor.
@samnordmann
Copy link
Collaborator

samnordmann commented Nov 6, 2024

One detailed challenge, regardless of one-pass or two-pass, is a sum along a host dimension. It can't run entirely on host (the addition would be super slow) or entirely on device (would disable overlapping). Host IR lowering has to turn that into a for-loop of additions and fuse the addition into the previous kernel.

I am not sure to understand why there is specific challenge here. Don't we just need to accumulate accros the host for-loop iteration? HostIr Executor can support that through a HostUnit with aliased I/O, or we could also easily add support for at::sum_out and at::add_out

naoyam added a commit that referenced this pull request Nov 7, 2024
…he (#3349)

This is just mechanical name change only. Intended to simplify #3263.
@naoyam
Copy link
Collaborator

naoyam commented Nov 7, 2024

@csarofeen I merged #3349. I did git merge -s ours, so nothing should be overwritten by the merge.

@naoyam
Copy link
Collaborator

naoyam commented Nov 7, 2024

!test

naoyam added a commit that referenced this pull request Nov 7, 2024
Follow-up to #3349 

`KernelExecutor::compileFusion` -> `KernelExecutor::compile`
`KernelExecutor::runFusion` -> `KernelExecutor::run`
@naoyam
Copy link
Collaborator

naoyam commented Nov 7, 2024

!test

@wujingyue
Copy link
Collaborator

For example, if we want to pipeline Allgather + GEMM for fine grain overlap, we might want to schedule the Host IR program leaving the GEMM as a "bulked" HostUnit; but then scheduling the GEMM might need a further segmentation into several kernels. Does it make sense?

Maybe -- I'm unsure what it buys us to "leave the GEMM as a bulked HostUnit". Before host IR lowering, we already know which IterDomains in loop domains are host-parallelized. So I believe host IR lowering can run a compute-at-map analysis to generate a host for-loop regardless of how many HostUnits/segments the loop body contains.

@wujingyue
Copy link
Collaborator

HostIr Executor can support that through a HostUnit with aliased I/O

Yes. I was talking about the mechanism to generate that HostUnit/Fusion with I/O aliases. It's easy to device-lower a sum into a loop of additions, because it's put in one IrContainer, the kernel. However, host IR lowering has to deal with multiple containers. (I wouldn't be surprised at all if you know how to implement this -- I am just unsure myself)

or we could also easily add support for at::sum_out and at::add_out

Yes but suboptimal -- inputs to sum_out and add_out would have to be materialized to global memory.

@samnordmann
Copy link
Collaborator

samnordmann commented Nov 7, 2024

For example, if we want to pipeline Allgather + GEMM for fine grain overlap, we might want to schedule the Host IR program leaving the GEMM as a "bulked" HostUnit; but then scheduling the GEMM might need a further segmentation into several kernels. Does it make sense?

Maybe -- I'm unsure what it buys us to "leave the GEMM as a bulked HostUnit". Before host IR lowering, we already know which IterDomains in loop domains are host-parallelized. So I believe host IR lowering can run a compute-at-map analysis to generate a host for-loop regardless of how many HostUnits/segments the loop body contains.

I mean that the scheduler will be applied in a hierarchical way. We will need to "host" schedule "AG+GEMM" on the one hand, and also to schedule the GEMM on the other hand. So the two scheduler need to be kind of composable, and be applied one after the other on overlapping segments. As long as segmentation and scheduling are tied together, this also holds for segmentation.

HostIr Executor can support that through a HostUnit with aliased I/O

Yes. I was talking about the mechanism to generate that HostUnit/Fusion with I/O aliases. It's easy to device-lower a sum into a loop of additions, because it's put in one IrContainer, the kernel. However, host IR lowering has to deal with multiple containers. (I wouldn't be surprised at all if you know how to implement this -- I am just unsure myself)

or we could also easily add support for at::sum_out and at::add_out

Yes but suboptimal -- inputs to sum_out and add_out would have to be materialized to global memory.

By definition, if it's a host operation, we will need to produce the data on global memory. That is true for any host op, also in the case it is a HostUnit with aliased I/O. I think that the HostUnit alternative will not achieve anything more than what add_out and sum_out can achieve. So I am not sure to understand what you are suggesting.

There is no Host lowering today so creating this HostUnit with aliased I/O is not implemented, however, it doesn't seem hard to implement, unless I'm missing something...

@wujingyue
Copy link
Collaborator

I think that the HostUnit alternative will not achieve anything more than what add_out and sum_out can achieve.

I think the HostUnit alternative gives fewer global reads/writes when the addition can be fused to the preceding kernel.

c = sum_H(a*b)  # H means along a host-parallel dimension

With the HostUnit alternative, each iteration reads a, reads b, reads c, computes c+a*b and writes that as the updated c.

With sum_out or add_out, each iteration reads a, reads b, writes a*b, reads a*b, reads c, computes c+a*b, and writes the updated c.

@samnordmann
Copy link
Collaborator

I think that the HostUnit alternative will not achieve anything more than what add_out and sum_out can achieve.

I think the HostUnit alternative gives fewer global reads/writes when the addition can be fused to the preceding kernel.

c = sum_H(a*b)  # H means along a host-parallel dimension

With the HostUnit alternative, each iteration reads a, reads b, reads c, computes c+a*b and writes that as the updated c.

With sum_out or add_out, each iteration reads a, reads b, writes a*b, reads a*b, reads c, computes c+a*b, and writes the updated c.

However here the accumulation across iteration is still done on a globally allocated buffer (c).

I think what you are describing here corresponds to fusing operations within the host for-loop's body, say into one HostUnit, which gives the classical benefit of fusing kernels. However, my point is that fusing across iterations is not possible, and the I/O of the for-loop's body (here, a, b, and c, but not a*b indeed) must be on global memory.

@wujingyue
Copy link
Collaborator

However, my point is that fusing across iterations is not possible

Agreed! I wasn't trying to argue about that.

@csarofeen
Copy link
Collaborator Author

@samnordmann @wujingyue this conversation seems pretty great, it'd be wonderful if you could capture it in a design doc.

std::vector<at::Tensor> outputs) {
FUSER_PERF_SCOPE("ExprEvalExecutor::run");

if (isProfilerEnabled()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csarofeen Don't we need to set the current device here like line 242?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fusion Profiler is just a logging system, it coordinates/accumulates information based on group_id_. As long as we set device correctly once it's fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but when this run function is first called, is it guaranteed to have the profiler have the correct device already set?

Comment on lines +241 to +242
FusionProfiler::segment(group_id_).stopKernel();
FusionProfiler::segment(group_id_).setDevice(args.getDeviceIndex());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csarofeen Why is setDevice done after the kernel execution?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, either way should be functional.

@naoyam
Copy link
Collaborator

naoyam commented Nov 8, 2024

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

@Priya2698
Copy link
Collaborator

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

Are you interested in a complete sweep or only the host benchmarking? We can run a complete sweep on the CI.
CC: @xwang233

@naoyam
Copy link
Collaborator

naoyam commented Nov 8, 2024

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

Are you interested in a complete sweep or only the host benchmarking? We can run a complete sweep on the CI. CC: @xwang233

Please do a complete sweep just in case. Either A100 or H100. Not necessary to check both.

@Priya2698
Copy link
Collaborator

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

Are you interested in a complete sweep or only the host benchmarking? We can run a complete sweep on the CI. CC: @xwang233

Please do a complete sweep just in case. Either A100 or H100. Not necessary to check both.

Got it, we will need to use CI resources then since the runs time out due to dlcluster time limits.
@xwang233 will be able to help. We can run preferably run on A100 due to more availability.

@xwang233
Copy link
Collaborator

xwang233 commented Nov 8, 2024

!test --pybench-full

…r user scheduling. (#3357)

The goal is to set `fusion_id` and `device_id` when creating
`KernelExecutor` for `UserSchedule. Previously, it was set during
`FusionExecutor::compileFusion`. This PR is stacked on
`executor_dispatch`

**Changes to `UserSchedule` cache system:**
**Current:** The map key is the integer value of input arguments. The
vector is of size `device id`.
`std::unordered_map<size_t, std::vector<UserSchedule>>
user_def_schedules;`

**New:** The key to first map is the integer value of input arguments.
The key to second map is of `device`.

**Why?** We can set the the `fusion_id` and `device_id` in the
constructor of `UserSchedule` and `KernelExecutor`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants