Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llama] Store KV Cache on CPU and Use PyTorch SPDA for Next token generation #1182

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zhentaoyu
Copy link

@zhentaoyu zhentaoyu commented Aug 2, 2024

What does this PR do?

image

Results

python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf --max_new_tokens 4096 --bf16 --use_kv_cache --attn_softmax_bf16 --reuse_cache --do_sample --prompt "Tell me somethings about Intel"

  • with --kv_cache_on_host
```bash Stats: -------------------------------------------------------------------------------------------------------------- Throughput (including tokenization) = 2.132539697795915 tokens/second Number of HPU graphs = 14 Memory allocated = 12.68 GB Max memory allocated = 12.77 GB Total memory available = 94.62 GB Graph compilation duration = 5842.699780527037 seconds~~ -------------------------------------------------------------------------------------------------------------- ```

update 4b0fa1a

Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 12.22449896564133 tokens/second
Number of HPU graphs                = 0
Memory allocated                    = 12.68 GB
Max memory allocated                = 12.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 1010.5770402610069 seconds
--------------------------------------------------------------------------------------------------------------
  • without --kv_cache_on_host
Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 31.41817953959749 tokens/second
Number of HPU graphs                = 11
Memory allocated                    = 14.68 GB
Max memory allocated                = 14.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 397.36551256105304 seconds
--------------------------------------------------------------------------------------------------------------

Limitations

  • can not generate correct results when --use_hpu_graphs because it has host-device memory transfer in the self-attn forward process.

cc @airMeng and @luoyu-intel

Update

Yi-34b-chat on gaudi-2 with ~11k input + 5k output
command:

python run_generation.py \
--model_name_or_path 01-ai/Yi-34B-Chat \
--use_kv_cache \
--bf16 \
--attn_softmax_bf16 \
--reuse_cache \
--do_sample \
--dataset_name emozilla/pg19-test \
--batch_size 1 \
--max_input_tokens 11200 \
--column_name "text" \
--dataset_max_samples 1 \
--warmup 0 \
--n_iterations 1 \
--max_new_tokens 5000 \
--kv_cache_on_host
  • without kv_cache_on_host:
 09/18/2024 05:28:11 - INFO - __main__ - Graph compilation...
Traceback (most recent call last):
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 707, in <module>
    main()
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 655, in main
    generate_dataset(batch)
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 633, in generate_dataset
    outputs = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1299, in generate
    result = self._sample(
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2239, in _sample
    self.htcore_generation.mark_step()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/utils/internal.py", line 26, in wrapper
    func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/step_closure.py", line 66, in mark_step
    htcore._mark_step(device_str, sync)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER workspace Allocation of size ::28127918336 failed!
  • with kv_cache_on_host:
Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------
  • eblarge output token num with kv_cache_on_host:
    --max_input_tokens 11200 --max_new_tokens 10000
Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------

@airMeng
Copy link

airMeng commented Aug 2, 2024

@hshen14 @luoyu-intel for awareness

@airMeng
Copy link

airMeng commented Aug 7, 2024

@mandy-li @libinta @dvarshney-habana This is the first PR of system optimization from intel neural compressor(INC) team, could you give a review?

Experiments of Llama2 on single Gaudi2 card with Xeon 8380 host. With offloading KV Cache and SDPA to CPU, we improve the context limit from 26k(input:10k+output:16k) to 310k(input:10k+output:300k).

Config Context HPU Memory (GB, steady/peak) CPU Memory (GB)
KV cache on HPU 10k+16k ~90GB NA
KV cache on HPU 10k+100 83.36/84.11 4.4
KV cache on HPU 12k+100 91.78/92.72 5.03
KV cache on HPU 12k+10k 92.06/93.0 7.68
KV cache on HPU 12k+100k OOM N/A
KV cache on HPU 10k+100k 86.22/86.97 31
KV cache on HPU 10k+300k 91.94/92.70 85

@zhentaoyu zhentaoyu marked this pull request as ready for review August 8, 2024 01:20
@zhentaoyu zhentaoyu requested a review from a user August 8, 2024 01:20
@zhentaoyu zhentaoyu requested a review from regisss as a code owner August 8, 2024 01:20
@emascarenhas
Copy link
Contributor

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

@zhentaoyu
Copy link
Author

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

done.

@imangohari1
Copy link
Contributor

imangohari1 commented Sep 10, 2024

@zhentaoyu
Thanks for the PR and the results in description.
Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences?
We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

@zhentaoyu
Copy link
Author

@zhentaoyu Thanks for the PR and the results in description. Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences? We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

  1. Yes. It's an option for long-context inference or generation when a single hpu card OOM. In this PR, I just use torch.Tensor.to to transfer kv_cache related tensors between CPU and Gaudi2 and make next token sdpa happen on CPU for saving data transferring time. However, It can not generate right answer when --use_hpu_graphs. I'm not familiar with the habana synapse graph, and please tell me if you have any insights, I'm happy to try to fix it.
  2. Ok, I have rebased the PR.

@zhentaoyu
Copy link
Author

Hi, @imangohari1, I have updated the PR (see descriptions). Could you please retake a look when you have free time? Please let me know if you have more comments or need more tests. Thanks a lot.
cc @hshen14

else:
unwrap_deepspeed_model(self).allocate_kv_cache(
bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From line 1096 to 1107, I would like to suggest to change like this.

if not is_greedy_or_beam_and_bucket:
cache_device = "hpu"
if generation_config.kv_cache_on_host and self.config.model_type in ["llama"]:
print("Allocate KV Cache on CPU...")
cache_device = "cpu"
unwrap_deepspeed_model(self).allocate_kv_cache(
bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens,
device=cache_device
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have updated it in 74e94ff. However, I can not remove the else line because I only modified the modeling_llama.py for this experimental feature.

@yeonsily
Copy link
Collaborator

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ?
The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

@zhentaoyu
Copy link
Author

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

Hi @yeonsily, thanks for your comment. Yes, I add a case in README and update the results in the PR description.

else:
with ht.sdp_kernel(enable_recompute=flash_attention_recompute):
else:
if kv_cache_on_host:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain what's the case switching kv_cache device? I thought line 656 is the case only when line 658.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pr, we make kv cache store on cpu and do cpu sdpa only when generating the next token. The first token or prefill stage is performed on HPU due to its powerful computation ability under long-context scenario (long prompt in most cases). The full pipeline diagram shows on the pr description.
So line 658 tells the machine it can do pytorch-cpu sdpa (flash-attn) only when kv_cache_on_host & in next-token generation & inference stage. Otherwise, it will transfer the kv-cache to hpu device if need for its original operations.
Please let me know if you need more explanations or have some suggestions. Thanks.

@airMeng
Copy link

airMeng commented Oct 29, 2024

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

@yeonsily the similiar features already available in tensorrt-llm https://nvidia.github.io/TensorRT-LLM/kv_cache_reuse.html#offloading-to-host-memory

@yeonsily
Copy link
Collaborator

@zhentaoyu Can you please rebase your change? We can merge this change after that.

@zhentaoyu
Copy link
Author

zhentaoyu commented Nov 22, 2024

@zhentaoyu Can you please rebase your change? We can merge this change after that.

Yes, rebased. Thanks a lot.

Copy link
Collaborator

@yeonsily yeonsily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve conflicts also.

@airMeng
Copy link

airMeng commented Nov 24, 2024

@yeonsily Anything else needed before merging?

@yeonsily
Copy link
Collaborator

@airMeng Can you please run all of the llama CI test to make sure this doesn't impact the current numbers? You can check the llama CI cases from https://github.com/huggingface/optimum-habana/tree/main/tests folder. Thanks.

@airMeng
Copy link

airMeng commented Nov 26, 2024

hi @yeonsily the CI/CD set on github can't be triggered?

@regisss
Copy link
Collaborator

regisss commented Nov 26, 2024

hi @yeonsily the CI/CD set on github can't be triggered?

I can trigger it but the PR CI won't test what @yeonsily is suggesting, you'll have to run these tests manually.

@zhentaoyu
Copy link
Author

Hi, @yeonsily, I have run the llama model test cases in tests/test_text_generation_example.py with this PR. code see below:

if os.environ.get("GAUDI2_CI", "0") == "1":
    # Gaudi2 CI baselines
    MODELS_TO_TEST = {
        "bf16_1x": [
            ("meta-llama/Llama-2-7b-hf", 1, True, 141.25776956002076, True),
            ("meta-llama/Meta-Llama-3-8B", 1, True, 129, False),
            ("meta-llama/Llama-2-7b-hf", 512, True, 12808, False),
            ("meta-llama/Llama-2-7b-hf", 512, False, 8711, False),  # in some cases like TGI, reuse_cache isnt used
        ],
        "fp8": [
            ("meta-llama/Llama-2-7b-hf", 1, 1230, False, 128, 128, 13152.7),
            ("meta-llama/Llama-2-7b-hf", 1, 163, False, 128, 2048, 4774.7),
            ("meta-llama/Llama-2-7b-hf", 1, 94, False, 2048, 128, 1293.3),
            ("meta-llama/Llama-2-7b-hf", 1, 81, False, 2048, 2048, 1942.9),
        ],
        "load_quantized_model_with_autogptq": [
            ("TheBloke/Llama-2-7b-Chat-GPTQ", 1, 10, False, 128, 2048, 456.7),
        ],
        "torch_compile": [
            ("meta-llama/Llama-2-7b-hf", 102.27823420713148),
        ],
        "torch_compile_distributed": [
            ("meta-llama/Llama-2-7b-hf", 39.72973199515235),
        ],
        "distributed_tp": [
            ("meta-llama/Llama-2-7b-hf", 1345.2369318328463),
        ],
    }

the running command is GAUDI2_CI=1 RUN_SLOW=true python test_text_generation_example.py 2>&1 | tee pytest_log.txt
my local machine driver version is 1.18.0-ee698fb and the docker image is vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

results:
image

the whole test log is here:
pytest_log.txt

@libinta libinta added the run-test Run CI for PRs from external contributors label Dec 3, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@regisss
Copy link
Collaborator

regisss commented Dec 3, 2024

@yeonsily
Copy link
Collaborator

yeonsily commented Dec 3, 2024

@zhentaoyu You will need this change to fix the CI failure.

diff --git a/optimum/habana/transformers/models/llama/modeling_llama.py b/optimum/habana/transformers/models/llama/modeling_llama.py
index 075edc8..d8866d7 100755
--- a/optimum/habana/transformers/models/llama/modeling_llama.py
+++ b/optimum/habana/transformers/models/llama/modeling_llama.py
@@ -443,7 +443,8 @@ class KVCache(torch.nn.Module):
     @staticmethod
     def update(prev, cur, dim, idx, inp_seq_len):
         cur = cur.to(prev.device)
-        idx = idx.to(prev.device)
+        if idx is not None:
+            idx = idx.to(prev.device)
         orig_cur = cur
         if prev.shape == cur.shape:
             prev.copy_(cur)

Meanwhile, I think you should also try llama training cases if your change doesn't affect the perf number. It seems you ran only inference case.

@zhentaoyu
Copy link
Author

@zhentaoyu You will need this change to fix the CI failure.

diff --git a/optimum/habana/transformers/models/llama/modeling_llama.py b/optimum/habana/transformers/models/llama/modeling_llama.py
index 075edc8..d8866d7 100755
--- a/optimum/habana/transformers/models/llama/modeling_llama.py
+++ b/optimum/habana/transformers/models/llama/modeling_llama.py
@@ -443,7 +443,8 @@ class KVCache(torch.nn.Module):
     @staticmethod
     def update(prev, cur, dim, idx, inp_seq_len):
         cur = cur.to(prev.device)
-        idx = idx.to(prev.device)
+        if idx is not None:
+            idx = idx.to(prev.device)
         orig_cur = cur
         if prev.shape == cur.shape:
             prev.copy_(cur)

Meanwhile, I think you should also try llama training cases if your change doesn't affect the perf number. It seems you ran only inference case.

Fixed, thanks.

As for training test, I test this function called test_multiple_peft_adapters locally since it using the llama model with test_trainer.py. Here is the result:
image

@regisss
Copy link
Collaborator

regisss commented Dec 4, 2024

@zhentaoyu Can you also run the Llama training regression tests with

GAUDI2_CI=1 RUN_SLOW=1 pytest tests/test_examples.py -v -s -k "llama"

please?

@zhentaoyu
Copy link
Author

@zhentaoyu Can you also run the Llama training regression tests with

GAUDI2_CI=1 RUN_SLOW=1 pytest tests/test_examples.py -v -s -k "llama"

please?

ok. Will update here if I get the result.

@zhentaoyu
Copy link
Author

zhentaoyu commented Dec 5, 2024

Hi, @regisss, I meet the error like below when using GAUDI2_CI=1 RUN_SLOW=1 pytest test_examples.py -v -s -k "llama"
image
the complete log file is here:
train_log.txt

pip list:
image
do you have any ideas about it? Thanks.

Copy link

github-actions bot commented Dec 5, 2024

The code quality check failed, please run make style.

@libinta libinta removed the run-test Run CI for PRs from external contributors label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants