You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could add utilities here for constructing a DAG (similar to how TensorFlow works), and dispatch into specialized kernels:
If the DAG shows that all broadcast expressions can be executed in parallel, then we can launch a kernel with an additional cuda thread dimension, so that all broadcast expressions can be computed asynchronously.
If the DAG shows that there heavy dependence between broadcast expressions, then simply fusing the kernel will allow for hoisted reads/writes.
If it's something in-between, we could potentially break up the fused object into multiple, specialized, kernels.
The text was updated successfully, but these errors were encountered:
We could add utilities here for constructing a DAG (similar to how TensorFlow works), and dispatch into specialized kernels:
If the DAG shows that there heavy dependence between broadcast expressions, then simply fusing the kernel will allow for hoisted reads/writes.
If it's something in-between, we could potentially break up the fused object into multiple, specialized, kernels.
The text was updated successfully, but these errors were encountered: