Scaling performance as sequences exceeds core count #115

AaronFriel · 2024-09-12T23:13:58Z

From #84:

Just as a general heads up - the problem with run into with AICI in production is the case where there are more sequences in batch (and thus parallel controller processes) than cores. This is because I spin for a while on futexes (to minimize latency), and this kills performance when we're out of cores. This would need to be fixed somehow. The latency minimization was mostly there when we still had post/pre_process(); for mid_process() it shouldn't matter that much.

Originally posted by @mmoskal in #84 (comment)

mmoskal · 2024-09-13T16:50:37Z

I wonder if the streaming protocol of WASI helps here - instead of using a futex using IPC with efficient reading and writing to shared circular buffers?

I guess regardless how wasm code communicates with host, the host has to organize the threads/processes for each sequence. In AICI this is done with separate processes which means you can kill the process and limit execution time without overhead (if you put a time limit in wasmtime it seems to have significant overhead; I don't remember exactly but I recall 30% or so in the inner bias computation loop).

The processes can also fork, though how needed that is for AICI functionality is another question.

Threads would be potentially easier to synchronize (though one has to be careful with things like rayon - they seem to have issues when you have more than 50 cores or so).

AaronFriel · 2024-09-13T23:44:04Z

I did some cursory research into the libraries available for cross-process IPC in Rust, and, well, essentially all of them add some significant degree of complexity to the implementation or aren't as latency optimized (e.g.: using eventfd with the ring buffer to signal).

What considerations went into using aicirt as a Rust binary as opposed to a library? I can guess one of those would be integration with vLLM.

I ask because I'm wondering if might make more sense to export a library interface to it to allow the host to decide how to handle concurrency, and e.g.: to oxidize it as a Python library for integration with vLLM using PyO3?

mmoskal · 2024-09-16T18:25:14Z

Indeed, exposing aicirt as a library might be better from python standpoint. However, there is still internal concurrency within aicirt - that is running multiple sequence controllers in parallel. I think you don't want to expose that to Python, as it would have performance implications. Either way, that can be handled as processes (as done now) or processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling performance as sequences exceeds core count #115

Scaling performance as sequences exceeds core count #115

AaronFriel commented Sep 12, 2024

mmoskal commented Sep 13, 2024

AaronFriel commented Sep 13, 2024 •

edited

Loading

mmoskal commented Sep 16, 2024

Scaling performance as sequences exceeds core count #115

Scaling performance as sequences exceeds core count #115

Comments

AaronFriel commented Sep 12, 2024

mmoskal commented Sep 13, 2024

AaronFriel commented Sep 13, 2024 • edited Loading

mmoskal commented Sep 16, 2024

AaronFriel commented Sep 13, 2024 •

edited

Loading