Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling performance as sequences exceeds core count #115

Open
AaronFriel opened this issue Sep 12, 2024 · 3 comments
Open

Scaling performance as sequences exceeds core count #115

AaronFriel opened this issue Sep 12, 2024 · 3 comments

Comments

@AaronFriel
Copy link

From #84:

Just as a general heads up - the problem with run into with AICI in production is the case where there are more sequences in batch (and thus parallel controller processes) than cores. This is because I spin for a while on futexes (to minimize latency), and this kills performance when we're out of cores. This would need to be fixed somehow. The latency minimization was mostly there when we still had post/pre_process(); for mid_process() it shouldn't matter that much.

Originally posted by @mmoskal in #84 (comment)

@mmoskal
Copy link
Member

mmoskal commented Sep 13, 2024

I wonder if the streaming protocol of WASI helps here - instead of using a futex using IPC with efficient reading and writing to shared circular buffers?

I guess regardless how wasm code communicates with host, the host has to organize the threads/processes for each sequence. In AICI this is done with separate processes which means you can kill the process and limit execution time without overhead (if you put a time limit in wasmtime it seems to have significant overhead; I don't remember exactly but I recall 30% or so in the inner bias computation loop).

The processes can also fork, though how needed that is for AICI functionality is another question.

Threads would be potentially easier to synchronize (though one has to be careful with things like rayon - they seem to have issues when you have more than 50 cores or so).

@AaronFriel
Copy link
Author

AaronFriel commented Sep 13, 2024

I did some cursory research into the libraries available for cross-process IPC in Rust, and, well, essentially all of them add some significant degree of complexity to the implementation or aren't as latency optimized (e.g.: using eventfd with the ring buffer to signal).

What considerations went into using aicirt as a Rust binary as opposed to a library? I can guess one of those would be integration with vLLM.

I ask because I'm wondering if might make more sense to export a library interface to it to allow the host to decide how to handle concurrency, and e.g.: to oxidize it as a Python library for integration with vLLM using PyO3?

@mmoskal
Copy link
Member

mmoskal commented Sep 16, 2024

Indeed, exposing aicirt as a library might be better from python standpoint. However, there is still internal concurrency within aicirt - that is running multiple sequence controllers in parallel. I think you don't want to expose that to Python, as it would have performance implications. Either way, that can be handled as processes (as done now) or processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants