Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Tmem tiled copy with non power-of-2 size fails to compile #2094

Open
tridao opened this issue Feb 10, 2025 · 3 comments
Open

[BUG] Tmem tiled copy with non power-of-2 size fails to compile #2094

tridao opened this issue Feb 10, 2025 · 3 comments
Labels
? - Needs Triage bug Something isn't working

Comments

@tridao
Copy link
Contributor

tridao commented Feb 10, 2025

using namespace cute;
auto tmem_layout = make_layout(make_shape(_128{}, _160{}), make_stride(Int<65536>{}, _1{}));
Tensor A = make_tensor(make_tmem_ptr<float>(0), tmem_layout);
auto load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A);

This fails to compile, with shape_div error. This is because make_tmem_copy calls make_cotile_copy, which calls left_inverse on the data layout. In this case left_inverse gives shape (65440, 1) to make it divisible by 160, but that leads to layout mismatch.

Is there another way I should be constructing the tmem tiled copy with non power-of-2 size?

Cc @thakkarV

@tridao tridao added ? - Needs Triage bug Something isn't working labels Feb 10, 2025
@ccecka
Copy link

ccecka commented Feb 10, 2025

Hmmmm, this is a really interesting bug, thanks Tri.

The core issue seems to be in my lazy implementation of left_inverse, which is simply incorrect in that case.

Indeed, when I hack in the correct left-inverse layouts to make_cotiled_copy, a valid partitioner (the same partitioner) is returned that performs exactly as we would expect

  {
  Layout tmem_layout = make_layout(make_shape(_128{}, _128{}), make_stride(Int<65536>{}, _1{}));
  Tensor A = make_tensor(make_tmem_ptr<float>(0), tmem_layout);
  TiledCopy load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A);

  print(tmem_layout); print("\n");
  print(complement(tmem_layout)); print("\n");
  print(left_inverse(tmem_layout)); print("\n");
  print(load);
  print(load.get_slice(0).partition_S(A));  print("\n");
  }

  {
  Layout tmem_layout = make_layout(make_shape(_128{}, _160{}), make_stride(Int<65536>{}, _1{}));
  Tensor A = make_tensor(make_tmem_ptr<float>(0), tmem_layout);
  TiledCopy load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A);

  print(tmem_layout); print("\n");
  print(complement(tmem_layout)); print("\n");
  print(left_inverse(tmem_layout)); print("\n");  // XXX Wrong
  print(load);
  print(load.get_slice(0).partition_S(A));   print("\n");
  }

prints

(_128,_128):(_65536,_1)
_512:_128
(_65536,_128):(_128,_1)
TiledCopy
  Tiler_MN:       ((_4,_32):(_32,_1),_1:_0)
  TiledLayout_TV: ((_32,_4),_32):((_0,_1),_4)
Copy_Atom
  ThrID:        _32:_1
  ValLayoutSrc: (_32,_32):(_0,_1)
  ValLayoutDst: (_32,_1):(_1,_1)
  ValLayoutRef: (_32,_32):(_0,_1)
  ValueType:    32b
tmem_[32b](0x0000.0000) o ((_32,_1),_1,_128):((_65536,_0),_0,_1)
(_128,_160):(_65536,_1)
_409:_160
_65440:_128
TiledCopy
  Tiler_MN:       ((_4,_32):(_32,_1),_1:_0)
  TiledLayout_TV: ((_32,_4),_32):((_0,_1),_4)
Copy_Atom
  ThrID:        _32:_1
  ValLayoutSrc: (_32,_32):(_0,_1)
  ValLayoutDst: (_32,_1):(_1,_1)
  ValLayoutRef: (_32,_32):(_0,_1)
  ValueType:    32b
tmem_[32b](0x0000.0000) o ((_32,_1),_1,_160):((_65536,_0),_0,_1)

Give me a little time to correct the implementation of left_inverse -- I'll put the patch here and get it merged asap.

@ccecka
Copy link

ccecka commented Feb 10, 2025

In the meantime, this also suggests that an immediate workaround is to build the partitioner with a very similar TMEM tensor and then use it on your actual TMEM tensor.

@tridao
Copy link
Contributor Author

tridao commented Feb 10, 2025

In the meantime, this also suggests that an immediate workaround is to build the partitioner with a very similar TMEM tensor and then use it on your actual TMEM tensor.

Yup that's the workaround i'm using right now: creating a fake tmem tensor of size (128, 128) to get the tiled tmem_copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants