Skip to content

Conversation

finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Sep 16, 2025

By moving to the async engine, we have a much cleaner central processing loop, as we process each request in a dedicated function, relying on vllm to handle the batch processing.

SGLang and TensorRT-LLM both use async engines, so by moving to VLLM's async API, it will be much easier to test out those engines.

Runs with inflight_updates=True:

  1. Single GPU debug run: Beaker (??% faster: 0m -> 1m).
  2. Single GPU debug run w/ tool use: Beaker (1.69x faster: 27m -> 16m).
  3. Multi-node test script: Beaker (1.9x faster: 19m -> 10m).

And with inflight_updates=False:

  1. Single GPU: Beaker
  2. Multi-node: Beaker

refs = [
engine.update_weight.remote(
name, dtype=param.dtype, shape=shape, empty_cache=count == num_params
name, dtype=str(param.dtype), shape=shape, empty_cache=count == num_params
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to support the async engine's serialization, which for some unknown reason, goes through MessagePack, and not pickle.


assert dtype == self.model_config.dtype, f"mismatch dtype: src {dtype}, dst {self.model_config.dtype}"
weight = torch.empty(shape, dtype=dtype, device="cuda")
assert dtype == str(self.model_config.dtype), (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed earlier, this is done because of weird serialization format in vLLM's async engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant