Shard input_mask for Llama #905

stbaione · 2025-02-03T23:46:05Z

Currently, 405b OOMs when using long input prompts (issue here).

This PR implements a suggested fix for this, which is to shard the input_mask. This makes sense, since the OOM error is dependent on the length of the input.

I'm new to sharktank and still seeing the issue with current implementation, so wanted to have it double-checked to make sure it's properly implemented.

MLIR here

sogartar

One thing that seems unaddressed is the scaled_dot_product_attention. It also applies the mask. How would that be reconciled? If is_causal=False and attention_mask=None does it not apply any mask?

sogartar · 2025-02-04T23:33:05Z

sharktank/sharktank/models/llama/llama.py

@@ -129,20 +129,25 @@ def prefill(
        # [bs, batch_seq_len]
        tokens: Union[torch.Tensor, ReplicatedTensor],
        *,
+        # [bs, batch_seq_len]
+        input_mask: Union[torch.Tensor, ReplicatedTensor],


This input needs a check such that it should be mutually exclusive with the attention_mask arg.

sogartar · 2025-02-04T23:33:15Z

sharktank/sharktank/models/llama/llama.py

@@ -166,6 +171,8 @@ def decode(
        # [bs, 1]
        tokens: Union[torch.Tensor, ReplicatedTensor],
        *,
+        # [bs, 1]
+        input_mask: Union[torch.Tensor, ReplicatedTensor],


This input needs a check such that it should be mutually exclusive with the attention_mask arg.

sogartar · 2025-02-05T13:45:49Z

sharktank/sharktank/models/llama/llama.py

        self._assert_device(attention_mask, dtype=self.activation_dtype)
        self._assert_device(seq_block_ids)
        self._assert_device(*cache_state, dtype=self.activation_dtype)

        h = self.token_embedding(tokens)
        self.trace_tensor("llama.token_embedding", h)

+        h *= input_mask.unsqueeze(-1)


I don't think that if we assume this should somehow substitute the attention mask the math checks out.

Shard input_mask for Llama

afe0668

stbaione requested review from rsuderman, dan-garvey and Groverkss February 3, 2025 23:46

sogartar reviewed Feb 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard input_mask for Llama #905

Shard input_mask for Llama #905

stbaione commented Feb 3, 2025 •

edited

Loading

sogartar left a comment •

edited

Loading

sogartar Feb 4, 2025

sogartar Feb 4, 2025

sogartar Feb 5, 2025

Shard input_mask for Llama #905

Are you sure you want to change the base?

Shard input_mask for Llama #905

Conversation

stbaione commented Feb 3, 2025 • edited Loading

sogartar left a comment • edited Loading

Choose a reason for hiding this comment

sogartar Feb 4, 2025

Choose a reason for hiding this comment

sogartar Feb 4, 2025

Choose a reason for hiding this comment

sogartar Feb 5, 2025

Choose a reason for hiding this comment

stbaione commented Feb 3, 2025 •

edited

Loading

sogartar left a comment •

edited

Loading