Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drafting polymorphic value support for dtensors. #3940

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Feb 21, 2025

Was giving this a quick shot but was having some difficulty. Just posting as a heads up for @wujingyue @syed-ahmed

I would prefer not to modify polymorphic_value.h so heavily (looks like I also removed some functions that are needed). To prevent modifying polymorphic value so much we'd need a simplier DistributedTensor implementation that doen'st depend on DeviceMesh and ParallelType

We just end up getting in some circular dependencies with type.h

A bit more context:
I intend to change all return types of nvFuser to KernelArgumentHolder which holds a vector of PolymorphicValue. So, if we want to return distributed tensors they should work with PolymorphicValue.

@csarofeen csarofeen marked this pull request as draft February 21, 2025 16:57
Copy link

Description

  • Added support for polymorphic value in distributed tensors.

  • Implemented isSame and toTensor functions in polymorphic_value.cpp.

  • Enhanced DistributedTensor class with copy constructor and assignment operator.

  • Updated Val class to use PolymorphicValue_functions::castToDtype.


Changes walkthrough 📝

Relevant files
Enhancement
9 files
device_mesh.cpp
Include additional headers for polymorphic value support.
+3/-2     
polymorphic_value.cpp
Implement `isSame` and `toTensor` functions.                         
+91/-5   
distributed_tensor.cpp
Add copy constructor, assignment operator, and destructor for
DistributedTensor.
+31/-1   
fusion_definition.cpp
Update `FusionDefinition::execute` to use pointer for `DeviceMesh`.
+1/-1     
base_nodes.h
Update `Val` class to use `PolymorphicValue_functions::castToDtype`.
+1/-1     
polymorphic_value.h
Add DistributedTensor to PolymorphicValue and remove redundant
functions.
+19/-169
distributed_tensor.h
Update `DistributedTensor` to use `std::unique_ptr` for `DeviceMesh`.
+12/-14 
fusion_record.h
Update ScalarRecord to use PolymorphicValue_functions::castToDtype.
+3/-1     
type.h
Remove redundant `castToDtype` function.                                 
+0/-23   

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
⚡ Recommended focus areas for review

Possible Issue

The removal of several functions from polymorphic_value.h might lead to missing functionality. Ensure that the removed functions are not needed elsewhere in the codebase.

inline bool isSameNanSensitive(const T& a, const T& b) {
  if (isNan(a) && isNan(b)) {
    return true;
  }
  return a == b;
}

bool isSame(const PolymorphicValue& a, const PolymorphicValue& b);

// Convert scalars, vector of scalars, vector of vector of scalars, etc., into
// an at::Tensor. device argument allows for the creation of CPU Scalars.
PolymorphicValue toTensor(
    const PolymorphicValue& x,
    at::DeviceType device_type = at::kCUDA,
    int8_t device_index = 0);

constexpr bool isScalar(const PolymorphicValue& x) {
  return x.is<int64_t>() || x.is<double>() || x.is<bool>() ||
      x.is<std::complex<double>>();
}

// Convert PolymorphicValue to c10::Scalar.
inline c10::Scalar toScalar(const PolymorphicValue& x) {
  if (x.is<double>()) {
    return (c10::Scalar)x.as<double>();
  } else if (x.is<int64_t>()) {
    return (c10::Scalar)x.as<int64_t>();
  } else if (x.is<bool>()) {
    return (c10::Scalar)x.as<bool>();
  } else if (x.is<std::complex<double>>()) {
    return (c10::complex<double>)x.as<std::complex<double>>();
  }
  NVF_THROW("Cannot convert ", x, " to a scalar.");
}

PolymorphicValue IValueToPolymorphicValue(const c10::IValue& val);

c10::IValue toIValue(const PolymorphicValue& x);

PolymorphicValue castToDtype(PolymorphicValue value, const DataType& dtype);

} // namespace PolymorphicValue_functions

} // namespace nvfuser

#include <struct.inl>
Circular Dependency

Including DeviceMesh and ParallelType in distributed_tensor.h might introduce circular dependencies. Ensure that the inclusion is necessary and does not cause issues.

namespace nvfuser {
class DeviceMesh;
enum class ParallelType;
namespace python_frontend {
Memory Management

The use of std::make_unique in the copy constructor and assignment operator might lead to memory management issues. Ensure that the ownership semantics are correct and that there are no memory leaks.

DistributedTensor::DistributedTensor(const DistributedTensor& other)
    : local_(other.local_), mesh_(std::make_unique<DeviceMesh>(*other.mesh_)) {
  axis_sharded_on_ = other.axis_sharded_on_;
}

DistributedTensor& DistributedTensor::operator=(
    const DistributedTensor& other) {
  if (this != &other) {
    local_ = other.local_;
    mesh_ = std::make_unique<DeviceMesh>(*other.mesh_);
    axis_sharded_on_ = other.axis_sharded_on_;
  }
  return *this;
}

DistributedTensor::~DistributedTensor() = default;

@wujingyue
Copy link
Collaborator

I think it'll help if I do https://github.com/NVIDIA/Fuser/blob/main/csrc/python_frontend/fusion_definition.h#L210. I can do that today. I didn't do that because I like to bundle up mesh, shadings and tensor for clarity.

Can you remind me why returning PValue all the way to python frontend helps latency? Thunder talks in at::tensor so at some point we'll have to extract at::tensor out of PValue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants