You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I might have misunderstood the API, but with the following script, i'd expect the output to be two identical output_states (in ragged format). But when i run it, the first output state (when unpacked from the ragged format) is non-zero, while the second is all zeros.
As i increase the number of pages per sequence (i.e. change paged_kv_indptr -> [0, 4, 8], for example) while keeping the page size constant, the two outputs converge. But if i keep a single (incomplete) page per sequence, i always get this non-zero/zero behaviour for all page sizes.
The same behaviour also persists with torch.float16 vs. bfloat.
Are sequences shorter than a single page size not supported for batched prefills?
Hi @fergusfinn , I checked your script carefully and it turns out paged_kv_last_page_len = tensor([1, 1], device="cuda:0") is a tensor with data type int64 and was reinterpreted as int32 inside kernel (we should improve the error message).
wow thank you! I should have realised, i did the same thing previously for the qo_indptr, but got an illegal memory access instead.
Would it be possible to add something to the docs re. accepted integer datatypes? (i'm guessing its int32 for all of these 'pointer' tensors across the library?) Happy to open a PR if that's useful
i'm guessing its int32 for all of these 'pointer' tensors across the library
At kernel side we support kernels for idtype=int64 as well (though they are not compiled ahead-of-time as part of the wheel) but I think int32 is still the common practice at this moment.
Hi,
I might have misunderstood the API, but with the following script, i'd expect the output to be two identical output_states (in ragged format). But when i run it, the first output state (when unpacked from the ragged format) is non-zero, while the second is all zeros.
As i increase the number of pages per sequence (i.e. change
paged_kv_indptr
-> [0, 4, 8], for example) while keeping the page size constant, the two outputs converge. But if i keep a single (incomplete) page per sequence, i always get this non-zero/zero behaviour for all page sizes.The same behaviour also persists with
torch.float16
vs.bfloat
.Are sequences shorter than a single page size not supported for batched prefills?
The text was updated successfully, but these errors were encountered: