Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is AirLLM faster than llama.cpp? #206

Open
Lizonghang opened this issue Nov 18, 2024 · 1 comment
Open

Is AirLLM faster than llama.cpp? #206

Lizonghang opened this issue Nov 18, 2024 · 1 comment

Comments

@Lizonghang
Copy link

Lizonghang commented Nov 18, 2024

Dear Lyogavin,

Thanks for your wonderful work. I have a question about, does AirLLM run faster than llama.cpp? Do you have any data on that?

As I know, llama.cpp uses mmap to manage memory. When computation meets page faults, mmap automatically loads tensor weights from disk to memory and continue computation, and it also unloads less-used tensor weights when the memory load is high, all managed by the OS. So llama.cpp also supports very large LLMs, like the feature AirLLM provides.

I noticed that AirLLM uses prefetching to overlap disk IO latency and computation, will this be faster than llama.cpp (with mmap enabled)? And how much is the improvement?

@Xingwei-Tan
Copy link

Based on my experience, AirLLM is much slower. It has low VRAM usage, which cannot fully utilize the available resources. I currently haven't noticed any way to change that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants