-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Synthetic Data generation features #669
Comments
I think you may be overthinking it. There's an example here for using the dynamic generator to do bulk inference on many sequences while taking advantage of deduplication and batching automatically, up to the limits imposed by whatever cache size you can fit in VRAM. There is no need for padding this way. Cache fragmentation shouldn't be an issue, though I wasn't entirely sure about this so I added a defragmenter that automatically limits how much of an impact it might have, if it is an issue. I'm not sure at what point it would make sense to start offloading the model to system RAM. Perhaps at some extreme batch size (10k or whatever?), but generally the overhead of offloading layers is huge. There's about a 100x bandwidth difference between PCIe and VRAM. What's more, the benefits of batching only extend to the point where the memory bus is no longer saturated. After that point, a pass at bsz 1000 (or whatever) has twice as much latency as a pass at bsz 500. |
oh, I see. I guess I'm just wondering if it'd be feasible to increase the batch size enough to where the time it takes for a layer to finish running is roughly equal to the time it takes to transfer a layer and its associated KV cache. Then you could perform inference and data transfer simultaneously, and both not have to worry about a PCIe bottleneck, and not have to worry about model size (in terms of the number of layers) as long as those layers can fit in CPU mem, which is comparatively cheap and abundant. |
It's certainly possible to do inference layer by layer on a huge batch size. In fact this script does it already, to measure the difference in hidden states between a quantized model and the unquantized version loaded layer by layer. There isn't currently a mechanism for doing so with a cache, though. And for efficiency I guess you'd need a triple-buffered approach where you have one layer of keys/values being swapped to system RAM, one being worked on by the GPU, and then a third being loaded for the next layer. And weights would need to be double-buffered. Bulk inference with the dynamic generator is already kind of efficient, especially if you have some shared prefix for multiple sequences in a batch, or sequences of dissimilar length, but I guess this could be worth trying out. Not sure how much of a priority I could make it at the moment, though. |
Problem
I've been working on generating completions for the LLaVA-Instruct dataset. Putting aside the need for multimodal support (which is something I'm jankily hacking together on my end), It got me wondering if there were alternate decoding strategies that could take advantage of all the requests already being aggregated.
Solution
consider implementing functionality that supports the following:
Alternatives
I know that exllamav2 already supports things like paged attention and dynamic batching. There's a good chance I'm totally overthinking this problem, and the aforementioned features address the concerns better. I just don't know if cache fragmentation is more detrimental to performance than a few tokens of padding.
Explanation
It was my understanding that using larger batch sizes generally improves performance, but at the cost of memory usage. For ordinary chat uses, layer offloading is just too slow to make sense. But for generating synthetic data, TTFT doesn't matter. So you could theoretically make the batch size significantly larger. The time per batch is higher, but the larger batch size (ideally) more than makes up for it
Examples
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: