Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Synthetic Data generation features #669

Open
3 tasks done
AstrisCantCode opened this issue Nov 3, 2024 · 3 comments
Open
3 tasks done

[REQUEST] Synthetic Data generation features #669

AstrisCantCode opened this issue Nov 3, 2024 · 3 comments

Comments

@AstrisCantCode
Copy link

Problem

I've been working on generating completions for the LLaVA-Instruct dataset. Putting aside the need for multimodal support (which is something I'm jankily hacking together on my end), It got me wondering if there were alternate decoding strategies that could take advantage of all the requests already being aggregated.

Solution

consider implementing functionality that supports the following:

  1. Pre-tokenize the dataset and sort prompts by the number of tokens, to minimize padding (doesn't need to be implemented in exllama itself but still an important step)
  2. dynamically add prompts of a similar length to the current batch of sequences (whenever one of the sequences completes), again to minimize padding.
  3. offload layers and their corresponding KV cache to CPU memory to run large models and batch sizes on a limited VRAM budget
  4. Use very large batch sizes that maximize memory usage while keeping enough space for the current layer and hidden states

Alternatives

I know that exllamav2 already supports things like paged attention and dynamic batching. There's a good chance I'm totally overthinking this problem, and the aforementioned features address the concerns better. I just don't know if cache fragmentation is more detrimental to performance than a few tokens of padding.

Explanation

It was my understanding that using larger batch sizes generally improves performance, but at the cost of memory usage. For ordinary chat uses, layer offloading is just too slow to make sense. But for generating synthetic data, TTFT doesn't matter. So you could theoretically make the batch size significantly larger. The time per batch is higher, but the larger batch size (ideally) more than makes up for it

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@turboderp
Copy link
Member

I think you may be overthinking it. There's an example here for using the dynamic generator to do bulk inference on many sequences while taking advantage of deduplication and batching automatically, up to the limits imposed by whatever cache size you can fit in VRAM. There is no need for padding this way.

Cache fragmentation shouldn't be an issue, though I wasn't entirely sure about this so I added a defragmenter that automatically limits how much of an impact it might have, if it is an issue.

I'm not sure at what point it would make sense to start offloading the model to system RAM. Perhaps at some extreme batch size (10k or whatever?), but generally the overhead of offloading layers is huge. There's about a 100x bandwidth difference between PCIe and VRAM. What's more, the benefits of batching only extend to the point where the memory bus is no longer saturated. After that point, a pass at bsz 1000 (or whatever) has twice as much latency as a pass at bsz 500.

@AstrisCantCode
Copy link
Author

AstrisCantCode commented Nov 9, 2024

oh, I see. I guess I'm just wondering if it'd be feasible to increase the batch size enough to where the time it takes for a layer to finish running is roughly equal to the time it takes to transfer a layer and its associated KV cache. Then you could perform inference and data transfer simultaneously, and both not have to worry about a PCIe bottleneck, and not have to worry about model size (in terms of the number of layers) as long as those layers can fit in CPU mem, which is comparatively cheap and abundant.

@turboderp
Copy link
Member

It's certainly possible to do inference layer by layer on a huge batch size. In fact this script does it already, to measure the difference in hidden states between a quantized model and the unquantized version loaded layer by layer.

There isn't currently a mechanism for doing so with a cache, though. And for efficiency I guess you'd need a triple-buffered approach where you have one layer of keys/values being swapped to system RAM, one being worked on by the GPU, and then a third being loaded for the next layer. And weights would need to be double-buffered.

Bulk inference with the dynamic generator is already kind of efficient, especially if you have some shared prefix for multiple sequences in a batch, or sequences of dissimilar length, but I guess this could be worth trying out. Not sure how much of a priority I could make it at the moment, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants