[Core] Optimize request output tokens putting back implementation to reduce overhead #45

s5u13b · 2024-10-08T05:37:19Z

Previously, we create a seperate thread with asynchrounous event loop to put request output tokens back through zeromq. But we found that it could result in serious performance interference with instance's step under high request load. So we fallback to use a seperate async actor to put request output tokens instead.
We found that non-blocking ray remote call has ms-level overhead lying in the critical path of llumlet step. So we start a seperate thread to do the remote call to overlap the remote call overhead with llumlet step.
We view multiple request output tokens as one item of asyncio queue in zeromq server to reduce the awaiting request output tokens overhead of api server.

llumnix/backends/vllm/llm_engine.py

llumnix/queue/queue_client_base.py

llumnix/config/default.py

github-actions · 2024-10-10T09:24:31Z

prefill	p25	p50	p75	p95	p99	mean
latency(ms)	28552.00	108443.50	183949.50	243163.10	265314.81	109545.74

decode	p25	p50	p75	p95	p99	mean
latency(ms)	52.87	59.64	76.85	163.74	508.52	79.84

github-actions · 2024-10-10T09:44:06Z

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	192.00 MB	200.00 MB	216.00 MB	232.00 MB	312.00 MB	328.00 MB	416.00 MB	424.00 MB	432.00 MB	472.00 MB	480.00 MB	536.00 MB	544.00 MB	560.00 MB	696.00 MB	1008.00 MB
rpc_speed(GB/s)	1.06	1.54	1.79	1.95	2.04	2.16	2.16	2.21	2.28	2.31	2.37	2.33	2.43	2.42	2.45	2.47	2.51	2.53	2.54	2.53	2.39	2.52	2.47	2.41	2.51	2.66	2.84	2.79	3.25	3.00	3.28	3.06	3.05	3.15	3.11	3.29	3.13	3.65

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	200.00 MB	216.00 MB	224.00 MB	232.00 MB	264.00 MB	280.00 MB	312.00 MB	320.00 MB	368.00 MB	384.00 MB	432.00 MB	440.00 MB	488.00 MB	536.00 MB
gloo_speed(GB/s)	0.92	1.56	2.04	2.26	2.32	2.53	2.75	2.66	2.85	2.88	2.98	2.89	2.70	2.83	3.71	2.58	2.35	2.57	2.49	2.43	2.24	2.54	2.43	2.68	2.69	2.65	2.79	2.73	1.07	2.71	2.47	2.43	2.49	2.44	2.42	2.68	2.59	2.78

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	200.00 MB	232.00 MB	312.00 MB	320.00 MB	352.00 MB	368.00 MB	416.00 MB	424.00 MB	456.00 MB	464.00 MB	480.00 MB	488.00 MB	536.00 MB
nccl_speed(GB/s)	0.19	0.47	0.66	0.85	1.05	1.25	1.35	1.53	1.84	2.21	2.24	2.77	2.42	2.87	2.39	2.78	2.87	3.18	3.55	3.70	3.30	3.70	4.14	3.39	2.79	4.08	4.80	5.36	5.12	5.50	5.26	5.72	5.63	7.05	5.78	4.54	4.13

github-actions · 2024-10-10T10:06:19Z

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	208.00 MB	216.00 MB	224.00 MB	232.00 MB	240.00 MB	272.00 MB	312.00 MB	320.00 MB	344.00 MB	368.00 MB	384.00 MB	416.00 MB	424.00 MB	448.00 MB	464.00 MB	472.00 MB	496.00 MB	536.00 MB	560.00 MB
rpc_speed(GB/s)	1.04	1.51	1.74	1.88	1.98	2.08	2.11	2.18	2.17	2.27	2.25	2.26	2.39	2.38	2.39	2.46	2.51	2.37	2.49	2.56	2.42	2.38	2.48	2.56	2.50	2.60	2.46	2.23	2.42	2.65	2.78	2.64	2.55	2.90	2.79	2.86	2.82	3.05	2.87	2.76	3.35	3.03	3.17

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	200.00 MB	224.00 MB	240.00 MB	264.00 MB	312.00 MB	400.00 MB	416.00 MB	432.00 MB	464.00 MB	480.00 MB	536.00 MB
gloo_speed(GB/s)	0.93	1.53	1.98	2.19	2.26	2.56	2.60	2.86	2.90	3.15	2.61	3.09	3.28	2.76	3.25	2.59	2.92	2.66	2.53	2.56	2.58	2.61	2.73	2.49	3.47	2.78	3.59	2.74	2.44	2.35	2.92	2.68	2.52	2.70	2.54

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	312.00 MB	320.00 MB	352.00 MB	416.00 MB	424.00 MB	464.00 MB	480.00 MB	488.00 MB	536.00 MB	712.00 MB
nccl_speed(GB/s)	0.19	0.48	0.66	0.95	0.95	1.40	1.45	1.78	1.99	1.83	2.10	2.32	2.14	2.77	2.56	3.16	2.79	2.77	3.86	3.77	3.52	3.86	3.45	2.65	5.03	5.33	3.47	6.59	5.09	5.40	5.52	2.69	4.92

github-actions · 2024-10-10T10:16:46Z

prefill	p25	p50	p75	p95	p99	mean
latency(ms)	25436.50	91076.50	186233.00	290558.65	323059.03	111883.88

decode	p25	p50	p75	p95	p99	mean
latency(ms)	54.26	60.76	79.69	137.60	373.60	77.96

github-actions · 2024-10-10T10:48:40Z

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	200.00 MB	232.00 MB	240.00 MB	312.00 MB	320.00 MB	416.00 MB	424.00 MB	480.00 MB	536.00 MB	912.00 MB
rpc_speed(GB/s)	1.03	1.52	1.75	1.87	2.01	2.11	2.14	2.22	2.23	2.31	2.32	2.45	2.24	2.43	2.47	2.47	2.42	2.52	2.50	2.47	2.59	2.52	2.50	2.28	2.56	2.69	2.68	2.77	2.66	3.15	2.96	2.97	2.89	3.40

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	200.00 MB	216.00 MB	312.00 MB	320.00 MB	384.00 MB	416.00 MB	432.00 MB	480.00 MB	560.00 MB	696.00 MB
gloo_speed(GB/s)	0.95	1.58	1.92	2.17	2.49	2.45	2.65	2.80	2.90	2.86	2.57	2.86	3.00	3.04	2.56	2.67	2.62	2.76	2.80	2.31	2.50	2.70	2.43	2.72	2.42	1.97	1.92	2.69	2.61	2.07	2.76	1.76	2.56	2.33

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	192.00 MB	200.00 MB	232.00 MB	280.00 MB	312.00 MB	328.00 MB	416.00 MB	424.00 MB	432.00 MB	448.00 MB	464.00 MB	472.00 MB	480.00 MB	528.00 MB	536.00 MB	696.00 MB
nccl_speed(GB/s)	0.20	0.51	0.74	0.93	1.12	1.22	1.40	1.57	1.90	1.83	2.06	2.60	2.48	2.28	2.24	2.90	2.41	3.03	3.40	3.54	3.43	3.84	3.03	2.37	4.46	2.61	4.40	4.39	6.50	5.17	5.72	5.93	5.68	6.11	5.42	4.47	4.33	4.98

github-actions · 2024-10-10T11:04:30Z

prefill	p25	p50	p75	p95	p99	mean
latency(ms)	25736.25	108441.00	196899.75	232071.45	244992.40	111121.44

decode	p25	p50	p75	p95	p99	mean
latency(ms)	51.61	56.68	71.35	135.43	488.73	74.40

github-actions · 2024-10-10T12:26:19Z

prefill	p25	p50	p75	p95	p99	mean
latency(ms)	28228.50	97536.50	205628.50	236255.10	240796.99	112317.38

decode	p25	p50	p75	p95	p99	mean
latency(ms)	51.53	57.60	72.66	118.24	205.46	69.30

github-actions · 2024-10-10T12:46:38Z

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	200.00 MB	224.00 MB	240.00 MB	280.00 MB	288.00 MB	296.00 MB	312.00 MB	416.00 MB	424.00 MB	432.00 MB	488.00 MB	704.00 MB
rpc_speed(GB/s)	1.04	1.53	1.80	1.90	2.00	2.06	2.19	2.12	2.22	2.29	2.27	2.26	2.32	2.40	2.45	2.43	2.55	2.50	2.41	2.44	2.54	2.51	2.22	2.43	2.59	2.56	2.80	2.55	2.72	2.78	2.77	2.87	2.90	2.99	3.13	3.23

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	200.00 MB	208.00 MB	224.00 MB	280.00 MB	312.00 MB	320.00 MB	368.00 MB	416.00 MB	440.00 MB	472.00 MB	480.00 MB	560.00 MB	568.00 MB	912.00 MB
gloo_speed(GB/s)	0.96	1.51	1.84	2.17	2.35	2.30	2.66	2.76	2.71	2.66	2.96	2.90	3.00	3.19	3.34	2.61	2.46	2.65	2.56	2.48	2.19	3.02	2.41	3.12	2.67	2.53	2.30	2.44	2.49	2.64	2.52	1.90	2.80	2.69	2.74	2.56	2.60	2.31

migration_size	8.00 MB	16.00 MB	24.00 MB	32.00 MB	40.00 MB	48.00 MB	56.00 MB	64.00 MB	72.00 MB	80.00 MB	88.00 MB	96.00 MB	104.00 MB	112.00 MB	120.00 MB	128.00 MB	136.00 MB	144.00 MB	152.00 MB	160.00 MB	168.00 MB	176.00 MB	184.00 MB	192.00 MB	208.00 MB	216.00 MB	224.00 MB	240.00 MB	280.00 MB	312.00 MB	352.00 MB	416.00 MB	440.00 MB	480.00 MB	488.00 MB	536.00 MB
nccl_speed(GB/s)	0.20	0.44	0.63	0.85	1.01	1.21	1.49	1.63	1.87	1.93	1.87	2.34	2.16	2.40	2.59	2.76	2.78	3.06	3.61	3.11	2.94	3.75	3.55	3.42	5.22	4.07	4.65	4.73	2.94	4.65	5.28	4.68	5.67	5.98	5.64	3.96

s5u13b requested review from zhypku, KuilongCui and ZeldaHuang October 8, 2024 05:37

s5u13b changed the title ~~[Misc] Use async actor to put request output tokens back through zeromq~~ [Core] Use async actor to put request output tokens back through zeromq Oct 8, 2024

s5u13b changed the title ~~[Core] Use async actor to put request output tokens back through zeromq~~ [Core] Refine request output tokens putting back implementation to reduce overhead Oct 9, 2024

KuilongCui reviewed Oct 9, 2024

View reviewed changes

llumnix/backends/vllm/llm_engine.py Outdated Show resolved Hide resolved

KuilongCui approved these changes Oct 9, 2024

View reviewed changes

zhypku reviewed Oct 10, 2024

View reviewed changes

llumnix/backends/vllm/llm_engine.py Show resolved Hide resolved

s5u13b changed the title ~~[Core] Refine request output tokens putting back implementation to reduce overhead~~ [Core] Optimize request output tokens putting back implementation to reduce overhead Oct 10, 2024

zhypku approved these changes Oct 10, 2024

View reviewed changes

s5u13b added 6 commits October 10, 2024 07:36

Batched request output in zeromq

1f33743

Use actor to async put request output

d27b722

Fix lint

d4c784e

Use thread to overlap the put queue remote call overhead

bb8c951

Fix

aab3695

Fix

70f425a

s5u13b force-pushed the zeromq-refine branch from 3649b02 to 70f425a Compare October 10, 2024 09:06

Fix

9e8e579

KuilongCui approved these changes Oct 10, 2024

View reviewed changes

llumnix/queue/queue_client_base.py Outdated Show resolved Hide resolved

llumnix/config/default.py Outdated Show resolved Hide resolved

s5u13b added 2 commits October 10, 2024 09:34

Fix

dcbf02d

Fix

987f5fe

Pylint

800fcf1

s5u13b added 2 commits October 10, 2024 11:20

Fix offline inference

282b0b3

Minor

b336780

s5u13b added 3 commits October 10, 2024 11:50

Add TODO

46fe9e9

Fix ray request output queue failover

3e24831

Fix pylint

aee1868

s5u13b merged commit 6de3bd7 into main Oct 11, 2024
14 checks passed

s5u13b deleted the zeromq-refine branch October 17, 2024 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Optimize request output tokens putting back implementation to reduce overhead #45

[Core] Optimize request output tokens putting back implementation to reduce overhead #45

s5u13b commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

[Core] Optimize request output tokens putting back implementation to reduce overhead #45

[Core] Optimize request output tokens putting back implementation to reduce overhead #45

Conversation

s5u13b commented Oct 8, 2024 • edited Loading

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

s5u13b commented Oct 8, 2024 •

edited

Loading