Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed and memory plots #10

Merged
merged 7 commits into from
Nov 26, 2024
Merged

speed and memory plots #10

merged 7 commits into from
Nov 26, 2024

Conversation

maxjeblick
Copy link
Collaborator

This PR updates the speed_and_memory.ipynb notebook.

The notebook now plots

  • generation speed vs. compression rate
  • peak memory vs compression rate
  • cache size vs compression rate
    for both default and quantized cache.

I did not include prefilling speed, as this is mostly independent of the cache size (and the repo isn't designed to optimize this part).

Apart from that, the spelling of "Hugging Face" was changed in various files.

@SimJeg
Copy link
Collaborator

SimJeg commented Nov 25, 2024

Could you add a .gitattributes file with:

*.ipynb linguist-documentation

and check the GitHub page stop displaying "74% Jupyter" in the languages section ?

@SimJeg
Copy link
Collaborator

SimJeg commented Nov 25, 2024

(I don't manage to add inline comment in the notebook on the GitHub UI so I'll add them in this main thread)

Curiosity question: is get_size_of_cache necessary ? We know the formula for it: $\text{size} = 2 * n_{layers} * n_{heads} * d * \text{precision}$ with $\text{precision}=2$ for float16 and $\text{precision}=0.5$ for int4. The quantized cache adds some parameters with _scale and _shift, does it make a big difference ?

@SimJeg
Copy link
Collaborator

SimJeg commented Nov 25, 2024

Maybe use a global variable for attn_implementation after ckpt=..., device=..., with a comment # use attn_implementation="sdpa" if no access to flash attention

@SimJeg
Copy link
Collaborator

SimJeg commented Nov 25, 2024

About the plots:

  • it's great to display the 3 metrics (peak memory, cache size and generation time)
  • I would remove the dashed line because it's not very clear what it is. For instance if it's max memory, why isn't it always reached ?
  • the generation time for quantized cache is twice slower than with float16, is it expected ? in hf blog post it's not supposed to be slower. See also this issue
  • could you use the same y-axis for both bfloat16 and int4
  • maybe remove the 400 tokens curve ?

@SimJeg
Copy link
Collaborator

SimJeg commented Nov 25, 2024

It would be great to add a single plot that illustrates why using kvpress. I propose to recycle the data you created in the notebook with what I believed will interest users the most:

  • x-axis: context length
  • y-axis: peak memory usage
  • 4 curves:
    • bfloat16 cache
    • int4 cache
    • bfloat16 cache + 50% compression
    • int4 cache + 50% compression

Also, I would add an horizontal dashed line at X and clip all curves below this X value to clearly show that with compression / quantization you can fit more context length in your GPU.

The plot could be saved in an assets directory in the root of the repo along with the kvpress.jpg file. I would display it at the end of the intro

@SimJeg
Copy link
Collaborator

SimJeg commented Nov 25, 2024

Something I don't understand on memory usage:

  • bfloat16: with 80% compression, cache sizes go from ~16 to ~3 and peak memory from 45 to ~32 so perfect, 13GB saved in both cases
  • int4: cache size goes from ~8 for ~2 and peak memory from ~33 to ~30. Two questions:
    • we went from 6GB saved to only 3 in peak memory.
    • why is the cache size only 2x lower than float16 ?

I guess it's related to the _scale and _shift parameters. Would be great to add a comment on this, else it's a bit confusing

@maxjeblick
Copy link
Collaborator Author

Thanks for the feedback!

  1. attn_implementation: IMO, it's Ok to assume flash attn is available when running the benchmarks.
  2. As you pointed out, get_size_of_cache is not directly necessary. I added the function, as it is more explicit IMO.
  3. I would remove the dashed line: Makes sense, thanks for spotting
    4. the generation time for quantized cache is twice slower than with float16, is it expected? Interestingly, the prefilling time is similar for both methods. Seems that there's a different between prefilling and generation time.
  4. could you use the same y-axis for both bfloat16 and int4 Yeah, I can change
  5. maybe remove the 400 tokens curve ? Makes sense, I replot with 8_000 tokens start.

@SimJeg
Copy link
Collaborator

SimJeg commented Nov 25, 2024

Thanks for the updates and cool plots. Could you add the last plot in the main README ? (and maybe create an assets dir for the image and kvpress.jlg)

@maxjeblick
Copy link
Collaborator Author

Could you add the last plot in the main README ?

It is in the README under evaluation tab (I can also move it somewhere else, or create a new section). I placed the image under evaluation/assets, but can also move it to a new assets folder, probably together with the repo logo.

@SimJeg SimJeg merged commit c38d169 into main Nov 26, 2024
2 checks passed
@SimJeg SimJeg deleted the max/update_speed_memory branch November 26, 2024 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants