speed and memory plots #10

maxjeblick · 2024-11-22T19:05:50Z

This PR updates the speed_and_memory.ipynb notebook.

The notebook now plots

generation speed vs. compression rate
peak memory vs compression rate
cache size vs compression rate
for both default and quantized cache.

I did not include prefilling speed, as this is mostly independent of the cache size (and the repo isn't designed to optimize this part).

Apart from that, the spelling of "Hugging Face" was changed in various files.

SimJeg · 2024-11-25T09:08:40Z

Could you add a .gitattributes file with:

*.ipynb linguist-documentation

and check the GitHub page stop displaying "74% Jupyter" in the languages section ?

SimJeg · 2024-11-25T09:16:35Z

(I don't manage to add inline comment in the notebook on the GitHub UI so I'll add them in this main thread)

Curiosity question: is get_size_of_cache necessary ? We know the formula for it: $\text{size} = 2 * n_{layers} * n_{heads} * d * \text{precision}$ with $\text{precision}=2$ for float16 and $\text{precision}=0.5$ for int4. The quantized cache adds some parameters with _scale and _shift, does it make a big difference ?

SimJeg · 2024-11-25T09:19:04Z

Maybe use a global variable for attn_implementation after ckpt=..., device=..., with a comment # use attn_implementation="sdpa" if no access to flash attention

SimJeg · 2024-11-25T09:25:14Z

About the plots:

it's great to display the 3 metrics (peak memory, cache size and generation time)
I would remove the dashed line because it's not very clear what it is. For instance if it's max memory, why isn't it always reached ?
the generation time for quantized cache is twice slower than with float16, is it expected ? in hf blog post it's not supposed to be slower. See also this issue
could you use the same y-axis for both bfloat16 and int4
maybe remove the 400 tokens curve ?

SimJeg · 2024-11-25T09:29:40Z

It would be great to add a single plot that illustrates why using kvpress. I propose to recycle the data you created in the notebook with what I believed will interest users the most:

x-axis: context length
y-axis: peak memory usage
4 curves:
- bfloat16 cache
- int4 cache
- bfloat16 cache + 50% compression
- int4 cache + 50% compression

Also, I would add an horizontal dashed line at X and clip all curves below this X value to clearly show that with compression / quantization you can fit more context length in your GPU.

The plot could be saved in an assets directory in the root of the repo along with the kvpress.jpg file. I would display it at the end of the intro

SimJeg · 2024-11-25T09:36:36Z

Something I don't understand on memory usage:

bfloat16: with 80% compression, cache sizes go from ~16 to ~3 and peak memory from 45 to ~32 so perfect, 13GB saved in both cases
int4: cache size goes from ~8 for ~2 and peak memory from ~33 to ~30. Two questions:
- we went from 6GB saved to only 3 in peak memory.
- why is the cache size only 2x lower than float16 ?

I guess it's related to the _scale and _shift parameters. Would be great to add a comment on this, else it's a bit confusing

maxjeblick · 2024-11-25T09:49:44Z

Thanks for the feedback!

attn_implementation: IMO, it's Ok to assume flash attn is available when running the benchmarks.
As you pointed out, get_size_of_cache is not directly necessary. I added the function, as it is more explicit IMO.
I would remove the dashed line: Makes sense, thanks for spotting
4. the generation time for quantized cache is twice slower than with float16, is it expected? Interestingly, the prefilling time is similar for both methods. Seems that there's a different between prefilling and generation time.
could you use the same y-axis for both bfloat16 and int4 Yeah, I can change
maybe remove the 400 tokens curve ? Makes sense, I replot with 8_000 tokens start.

SimJeg · 2024-11-25T18:02:38Z

Thanks for the updates and cool plots. Could you add the last plot in the main README ? (and maybe create an assets dir for the image and kvpress.jlg)

maxjeblick · 2024-11-25T18:48:03Z

Could you add the last plot in the main README ?

It is in the README under evaluation tab (I can also move it somewhere else, or create a new section). I placed the image under evaluation/assets, but can also move it to a new assets folder, probably together with the repo logo.

maxjeblick added 3 commits November 22, 2024 13:21

fix Hugging Face wording

06b43c6

add speed benchmark plots

ad6d056

update plots

c0b9ccc

maxjeblick added 4 commits November 25, 2024 18:36

update notebook

587f859

fix size of plot

bb2b729

fix size of plot

e8c47a1

fix size of plot

9baabb5

SimJeg approved these changes Nov 26, 2024

View reviewed changes

SimJeg merged commit c38d169 into main Nov 26, 2024
2 checks passed

SimJeg deleted the max/update_speed_memory branch November 26, 2024 07:50

SimJeg mentioned this pull request Nov 26, 2024

docs: update zero_scrolls/README.md #8

Closed

maxjeblick mentioned this pull request Nov 26, 2024

Fix cache size calculation for quantized cache #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed and memory plots #10

speed and memory plots #10

maxjeblick commented Nov 22, 2024

SimJeg commented Nov 25, 2024

SimJeg commented Nov 25, 2024

SimJeg commented Nov 25, 2024

SimJeg commented Nov 25, 2024 •

edited

Loading

SimJeg commented Nov 25, 2024 •

edited

Loading

SimJeg commented Nov 25, 2024

maxjeblick commented Nov 25, 2024

SimJeg commented Nov 25, 2024

maxjeblick commented Nov 25, 2024

speed and memory plots #10

speed and memory plots #10

Conversation

maxjeblick commented Nov 22, 2024

SimJeg commented Nov 25, 2024

SimJeg commented Nov 25, 2024

SimJeg commented Nov 25, 2024

SimJeg commented Nov 25, 2024 • edited Loading

SimJeg commented Nov 25, 2024 • edited Loading

SimJeg commented Nov 25, 2024

maxjeblick commented Nov 25, 2024

SimJeg commented Nov 25, 2024

maxjeblick commented Nov 25, 2024

SimJeg commented Nov 25, 2024 •

edited

Loading

SimJeg commented Nov 25, 2024 •

edited

Loading