Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
400 commits
Select commit Hold shift + click to select a range
4dd25d6
Attempt at fixing issue
finbarrtimbers Sep 12, 2025
703599e
Moves delete
finbarrtimbers Sep 12, 2025
2c63fe7
Fixed bug in tool code.
finbarrtimbers Sep 12, 2025
78dd885
Ran linter. Tool script seems to run fine.
finbarrtimbers Sep 12, 2025
aa26e41
Fixed bug in non-tool use.
finbarrtimbers Sep 12, 2025
fcd4d78
Updated code to fix index bug.
finbarrtimbers Sep 12, 2025
02a403b
Undid host network change to tool_grpo_fast.sh.
finbarrtimbers Sep 12, 2025
80194ec
Fixed no host networking command from tool_grpo_fast.sh.
finbarrtimbers Sep 12, 2025
22882af
Fixed incorrect index field.
finbarrtimbers Sep 12, 2025
c15f081
Fixed index
finbarrtimbers Sep 13, 2025
02fa45b
Fix request tracking and training_step propagation for non-tool use
finbarrtimbers Sep 13, 2025
5543750
Fix KeyError by accessing training_step before cleanup
finbarrtimbers Sep 13, 2025
293b498
Add assertion to verify all n sub-requests are tracked
finbarrtimbers Sep 13, 2025
96ef484
Fix index field timing issue in non-tool mode with n>1
finbarrtimbers Sep 13, 2025
1092363
Fix CompletionOutput index for non-tool mode with n>1
finbarrtimbers Sep 13, 2025
812699b
Changed loop
finbarrtimbers Sep 14, 2025
790b2cc
Set verbose to true
finbarrtimbers Sep 14, 2025
b5d7567
Fixed tracking key bug and added a health check for prefetch worker t…
finbarrtimbers Sep 14, 2025
13db536
changed removal from vllm_active_requests to occur after _finalize
finbarrtimbers Sep 14, 2025
55dfcc9
Fix sub-request tracking race condition in non-tools mode
finbarrtimbers Sep 14, 2025
fb9344b
Removes debugging code from `backfill-prompts` (#1008)
finbarrtimbers Sep 15, 2025
afbc8ab
Merge branch 'main' into backfill-prompts
finbarrtimbers Sep 15, 2025
cd2c7d8
Fixed linter error.
finbarrtimbers Sep 15, 2025
91352b2
Add inflight_updates argument to enable quick pausing/resumption
finbarrtimbers Sep 15, 2025
9562997
Refactor process_from_queue loop to use _should_exit method
finbarrtimbers Sep 15, 2025
9111672
Removed comment.
finbarrtimbers Sep 15, 2025
3279680
Longer benchmark
finbarrtimbers Sep 15, 2025
0fa431d
undid changes to benchmark
finbarrtimbers Sep 15, 2025
10c2b10
Now, uses the async engine.
finbarrtimbers Sep 16, 2025
6279598
Fixed errors
finbarrtimbers Sep 16, 2025
b6f576d
Removed host networking from single
finbarrtimbers Sep 16, 2025
02db12e
much simpler flow
finbarrtimbers Sep 16, 2025
f89ecbe
Removed tracking variable.
finbarrtimbers Sep 16, 2025
3f24d9e
Cleans up lifecycle.
finbarrtimbers Sep 16, 2025
9aec035
Updated generate_one_completion.
finbarrtimbers Sep 16, 2025
787b6bc
Cleaned up main loop.
finbarrtimbers Sep 16, 2025
af08dc1
Refactored _process_request significantly
finbarrtimbers Sep 16, 2025
ea42341
Simplified process_from_queue
finbarrtimbers Sep 16, 2025
0fefb63
Updates code
finbarrtimbers Sep 16, 2025
4ed483d
Updated clusters
finbarrtimbers Sep 16, 2025
a342d68
updated script
finbarrtimbers Sep 16, 2025
e83fb31
updated script
finbarrtimbers Sep 16, 2025
459c9b0
updated script
finbarrtimbers Sep 16, 2025
383ad6c
Merge branch 'main' into async-engine
finbarrtimbers Sep 16, 2025
a7cfcee
Updated script
finbarrtimbers Sep 16, 2025
cd05a9c
Updated script to match launch_benchmark.sh
finbarrtimbers Sep 16, 2025
1e76887
Fixed bug.
finbarrtimbers Sep 16, 2025
7f41a14
updated priority
finbarrtimbers Sep 16, 2025
c58e055
Fixed kv_cache_specs
finbarrtimbers Sep 16, 2025
b1e6167
Fixed kv_cache_specs
finbarrtimbers Sep 16, 2025
9e6920b
Added logging
finbarrtimbers Sep 16, 2025
5028e72
Fixed methods
finbarrtimbers Sep 16, 2025
dd37616
Ran linter.
finbarrtimbers Sep 16, 2025
43dfecb
Fix blocking ray.get in async actor
finbarrtimbers Sep 16, 2025
076d310
Improve async _should_stop to prevent blocking and duplicate requests
finbarrtimbers Sep 16, 2025
7831265
Added timeouts for all the scripts
finbarrtimbers Sep 17, 2025
706264b
Now, we always run the engine loop.
finbarrtimbers Sep 17, 2025
bb0b08f
Fixed engine initialization.
finbarrtimbers Sep 17, 2025
4a4715c
added await to generator.
finbarrtimbers Sep 17, 2025
007f089
Changed loggign for vllm
finbarrtimbers Sep 17, 2025
6658cc1
Fixed logging levels
finbarrtimbers Sep 17, 2025
8195d65
Added more timing
finbarrtimbers Sep 17, 2025
aa523a5
Fixed deadlock
finbarrtimbers Sep 17, 2025
70e8778
set logging to debug for vllm
finbarrtimbers Sep 17, 2025
fa96dfe
set logging to debug for vllm_utils3.py
finbarrtimbers Sep 17, 2025
94c8185
Fixed timeout bug
finbarrtimbers Sep 17, 2025
48a8324
an attempted fix
finbarrtimbers Sep 17, 2025
03fe974
Add async engine implementation for vLLM
finbarrtimbers Sep 17, 2025
5c9b277
Merge branch 'async-engine-rebased' into async-engine
finbarrtimbers Sep 17, 2025
7a9fd8c
Attempt at fixing
finbarrtimbers Sep 17, 2025
1ce0b89
Add detailed logging to track vLLM generation hang
finbarrtimbers Sep 17, 2025
442525c
fixed error
finbarrtimbers Sep 17, 2025
f2d18da
Fixed wait stalling
finbarrtimbers Sep 17, 2025
da54862
fix issues
finbarrtimbers Sep 17, 2025
534953e
Add comprehensive logging to debug queue hang issue
finbarrtimbers Sep 17, 2025
8aa3ff9
Add detailed logging to trace request flow through queues
finbarrtimbers Sep 17, 2025
10cce87
Fix process_from_queue hanging issue
finbarrtimbers Sep 18, 2025
a43772b
Add assertion to detect race condition in request ID reuse
finbarrtimbers Sep 23, 2025
bd9306f
Fix premature exit in vllm_utils3 when using tools
finbarrtimbers Sep 23, 2025
9a21126
Removed metadata from logs
finbarrtimbers Sep 23, 2025
338c987
Add detailed logging to _process_request to debug tool use hang
finbarrtimbers Sep 23, 2025
9754f52
Add detailed logging around tokenizer access to debug hang
finbarrtimbers Sep 23, 2025
a52d9b6
Updated endpoint
finbarrtimbers Sep 23, 2025
3763476
Add detailed logging between lines 653-682 to identify exact hang loc…
finbarrtimbers Sep 23, 2025
7828a69
Add detailed logging to debug token_ids access hang
finbarrtimbers Sep 23, 2025
1e3eb39
Fix tuple/list concatenation issue in vllm_utils3.py
finbarrtimbers Sep 23, 2025
9fc1bb9
Fix tuple concatenation issue in tool execution path
finbarrtimbers Sep 23, 2025
e03df7d
Add detailed debugging to track types during token concatenation
finbarrtimbers Sep 23, 2025
bc303d5
Fix attribute access for max_model_len in tool execution path
finbarrtimbers Sep 23, 2025
6106303
updated to use timeouterror
finbarrtimbers Sep 23, 2025
4a05808
fixed syntax error
finbarrtimbers Sep 23, 2025
8692271
More logging
finbarrtimbers Sep 23, 2025
ba528ff
Updated code
finbarrtimbers Sep 23, 2025
765470f
Fixed loop
finbarrtimbers Sep 23, 2025
3833725
Fixed logging
finbarrtimbers Sep 23, 2025
82cf644
Attempted to mirror the synchronous loop.
finbarrtimbers Sep 23, 2025
517f7d3
hold the lock less.
finbarrtimbers Sep 23, 2025
8ec2912
less frequent logging
finbarrtimbers Sep 24, 2025
38ac8e3
Updated behaviour to match
finbarrtimbers Sep 24, 2025
c77145c
Fixed max exceeded calls
finbarrtimbers Sep 24, 2025
e7f8809
Updated tool path behaviour
finbarrtimbers Sep 24, 2025
09abb3d
Fix request ID collision in async engine for tool continuations
finbarrtimbers Sep 24, 2025
41ac860
Removed the lock
finbarrtimbers Sep 24, 2025
a4d78f9
CLeaned up PR.
finbarrtimbers Sep 24, 2025
bf9eb4c
Update async engine (#1043)
finbarrtimbers Oct 1, 2025
6268971
Removed debugging code
finbarrtimbers Oct 1, 2025
fdbf577
Ran linter
finbarrtimbers Oct 1, 2025
c768c5a
Merge branch 'main' into async-engine
finbarrtimbers Oct 1, 2025
46d0ff0
Fixed issue.
finbarrtimbers Oct 1, 2025
bba171a
Minimized differences between new code and old.
finbarrtimbers Oct 1, 2025
bc6bb3b
Merge branch 'main' into async-engine
finbarrtimbers Oct 1, 2025
70de1f5
Fixed cluster warning in large_test_script.sh.
finbarrtimbers Oct 1, 2025
294b5c1
Cleaned up PR.
finbarrtimbers Oct 1, 2025
c33e091
Updated assert threaded actor class.
finbarrtimbers Oct 1, 2025
75dc516
Fixed class.
finbarrtimbers Oct 1, 2025
ac07388
Merge branch 'main' into async-engine
finbarrtimbers Oct 1, 2025
aa7d0b2
Set default values for large_test_script.sh
finbarrtimbers Oct 1, 2025
7400450
set enforce eager
finbarrtimbers Oct 1, 2025
65cc8f2
now, we set infligth updates false
finbarrtimbers Oct 1, 2025
b909b84
Now, we don't set enforce eager.
finbarrtimbers Oct 1, 2025
dec49db
Merge branch 'main' into async-engine
finbarrtimbers Oct 1, 2025
c1cbe4a
Updated large_test_script.sh
finbarrtimbers Oct 1, 2025
b45c123
Fixed env var issue
finbarrtimbers Oct 1, 2025
ec252e3
Now we set inflight updates true
finbarrtimbers Oct 1, 2025
d44d2a7
trying to start/stop background loop
finbarrtimbers Oct 1, 2025
8697831
Merge branch 'main' into async-engine
finbarrtimbers Oct 1, 2025
e78f6fc
Removed start of loop
finbarrtimbers Oct 1, 2025
2a17b18
now, we use sleep/wake_up to make things work.
finbarrtimbers Oct 1, 2025
77f48b8
Set inflight true on single
finbarrtimbers Oct 1, 2025
27494dd
Removed sleep/wakeup code
finbarrtimbers Oct 1, 2025
8d6dad9
Updated code
finbarrtimbers Oct 1, 2025
2642fe3
Fixed typo
finbarrtimbers Oct 1, 2025
4f069a2
Ran linter
finbarrtimbers Oct 1, 2025
3acc95c
Fixed bug
finbarrtimbers Oct 1, 2025
6b6ce66
switched to use the v1 engine.
finbarrtimbers Oct 3, 2025
5d9af9c
updated code
finbarrtimbers Oct 3, 2025
fc678b4
Fixed issue where we were calling v0 APIs.
finbarrtimbers Oct 3, 2025
f9f9d13
Fixed hanging issue
finbarrtimbers Oct 3, 2025
9863c82
Updated code to remove pause_generation calls.
finbarrtimbers Oct 3, 2025
77f3ffa
updated code
finbarrtimbers Oct 3, 2025
b85eb97
Fixed abort issue
finbarrtimbers Oct 3, 2025
36810e6
updated code
finbarrtimbers Oct 3, 2025
d395428
Add diagnostic logging and fix vLLM v1 compatibility
finbarrtimbers Oct 3, 2025
25fa2ce
Set vllm logs to debug
finbarrtimbers Oct 3, 2025
cdfebaf
Updated vllm version to 10.2.
finbarrtimbers Oct 3, 2025
2d5e9a9
Updated flash attn version
finbarrtimbers Oct 3, 2025
0896efa
Ran uv sync
finbarrtimbers Oct 3, 2025
7c14485
Fix AsyncLLMEngine hanging by creating it within running event loop
finbarrtimbers Oct 3, 2025
a38b870
Move _init_engine_async to module-level function
finbarrtimbers Oct 3, 2025
191e6dd
Add comprehensive diagnostic logging for async task tracing
finbarrtimbers Oct 3, 2025
17e647d
Add diagnostic logging to trace process_from_queue exit behavior
finbarrtimbers Oct 3, 2025
ec05c5a
Add diagnostic logging for weight sync stop_requested toggle
finbarrtimbers Oct 3, 2025
55517f6
Add diagnostic logging to trace weight broadcast deadlock
finbarrtimbers Oct 3, 2025
5a729b4
Add event loop diagnostic logging to update_weight
finbarrtimbers Oct 3, 2025
a599b0d
Fix event loop mismatch in async RPC calls
finbarrtimbers Oct 3, 2025
302e4d5
Add assertions to verify event loop consistency
finbarrtimbers Oct 3, 2025
2eb35de
Fix async RPC deadlock by using sync-to-async bridge
finbarrtimbers Oct 3, 2025
8729027
Tried more fixes
finbarrtimbers Oct 5, 2025
0b3511d
Updated to remove generate thread
finbarrtimbers Oct 6, 2025
87ccb6d
Updated code to add processing
finbarrtimbers Oct 6, 2025
294577d
Chnage architecture
finbarrtimbers Oct 6, 2025
149d6ec
Now we set vllm_insecure
finbarrtimbers Oct 6, 2025
d647af4
Set inflight false
finbarrtimbers Oct 6, 2025
30b531d
removed message serialization
finbarrtimbers Oct 6, 2025
8179d30
removed some logs
finbarrtimbers Oct 6, 2025
d167793
Another attempt to fix hang
finbarrtimbers Oct 6, 2025
2647209
Merge branch 'main' into async-engine
finbarrtimbers Oct 6, 2025
c3f3966
lots of logging changes
finbarrtimbers Oct 6, 2025
2eafb9c
Ran linter.
finbarrtimbers Oct 6, 2025
b542744
Reset scripts.
finbarrtimbers Oct 6, 2025
292a272
Undid changes to mason.py
finbarrtimbers Oct 6, 2025
c9b9c59
Cleaned up PR.
finbarrtimbers Oct 6, 2025
7c6fbef
Cleaned up PR.
finbarrtimbers Oct 6, 2025
d5cd6f7
Cleaned up PR.
finbarrtimbers Oct 6, 2025
e019aa8
Cleaned up PR.
finbarrtimbers Oct 6, 2025
2c80999
Removed timeouterrro
finbarrtimbers Oct 6, 2025
e72cfa5
Cleaned up PR
finbarrtimbers Oct 6, 2025
5c4d405
Uses async for
finbarrtimbers Oct 6, 2025
f33190b
Now, we handle tools.
finbarrtimbers Oct 6, 2025
ad6986f
Cleaned assert code.
finbarrtimbers Oct 6, 2025
5d10dc3
Attempty at fixing code.
finbarrtimbers Oct 6, 2025
f690312
Cleaned up assert
finbarrtimbers Oct 6, 2025
c1f83c0
Another attempt at fixing the bug
finbarrtimbers Oct 6, 2025
4165820
Updated code
finbarrtimbers Oct 6, 2025
cc3ba93
Fix tool execution hanging by using dedicated executor instead of asy…
finbarrtimbers Oct 6, 2025
daba023
Fix async event loop issue - use get_running_loop() instead of get_ev…
finbarrtimbers Oct 6, 2025
72eb57e
Add logging to check if executor is None during tool execution
finbarrtimbers Oct 6, 2025
1d42889
Add detailed logging to track tool execution and triggering
finbarrtimbers Oct 6, 2025
591d11f
Fix async event loop hanging by using unique request IDs for each ite…
finbarrtimbers Oct 6, 2025
bf6728a
Add detailed logging to trace async generation hang
finbarrtimbers Oct 6, 2025
d1f0a50
Add detailed logging after tool execution to trace iteration hang
finbarrtimbers Oct 6, 2025
b599b15
Add fine-grained logging to debug model config access hang
finbarrtimbers Oct 6, 2025
08db870
Add detailed logging to debug prompt concatenation hang
finbarrtimbers Oct 7, 2025
a59d9f0
Cache prompt_token_ids to avoid hang when accessing TokensPrompt prop…
finbarrtimbers Oct 7, 2025
3675f06
Fix undefined variable in assert_threaded_actor
finbarrtimbers Oct 7, 2025
9a9a5c0
Updated code
finbarrtimbers Oct 7, 2025
cafc0d2
Set inflight false
finbarrtimbers Oct 7, 2025
49724e8
Fixed duplicate flag
finbarrtimbers Oct 7, 2025
7a49068
Simplified significantly
finbarrtimbers Oct 8, 2025
be451be
Removed logs
finbarrtimbers Oct 8, 2025
2cf5750
Simplified threading model
finbarrtimbers Oct 8, 2025
025fb40
Added handling for inflight_updates
finbarrtimbers Oct 8, 2025
97610cd
Inlined generate_one_completion
finbarrtimbers Oct 8, 2025
10f9e1a
Clean up
finbarrtimbers Oct 8, 2025
f32879f
More clean up
finbarrtimbers Oct 8, 2025
cb10def
Set inflight to true
finbarrtimbers Oct 8, 2025
9708fc4
Cleaned up code.
finbarrtimbers Oct 8, 2025
fff7d87
lots of cleanup
finbarrtimbers Oct 8, 2025
03db661
Major refactor
finbarrtimbers Oct 8, 2025
e0a0960
More PR cleanup
finbarrtimbers Oct 8, 2025
60897fe
Merge branch 'main' into async-engine
finbarrtimbers Oct 8, 2025
2f592b5
Fixed code
finbarrtimbers Oct 8, 2025
d2203de
Cleaned up code.
finbarrtimbers Oct 8, 2025
35b4a1c
Merge branch 'main' into async-engine
finbarrtimbers Oct 9, 2025
e42a953
undid changes
finbarrtimbers Oct 9, 2025
e51ecce
Removed self.logger
finbarrtimbers Oct 9, 2025
6cdf576
A bunch of changes to minimize differences.
finbarrtimbers Oct 9, 2025
6e2b3d0
Merge branch 'main' into async-engine
finbarrtimbers Oct 9, 2025
84ecb7a
fixed error
finbarrtimbers Oct 9, 2025
dcbfdf7
Cleane dup code.
finbarrtimbers Oct 9, 2025
710919c
use mp
finbarrtimbers Oct 9, 2025
8372871
Set multiprocessing
finbarrtimbers Oct 10, 2025
f74cc92
Updated code
finbarrtimbers Oct 10, 2025
0994057
trying more changes
finbarrtimbers Oct 10, 2025
a963332
fixing logprob issue
finbarrtimbers Oct 10, 2025
0b8d273
Update code
finbarrtimbers Oct 10, 2025
9e90229
Old async engine2 (#1075)
finbarrtimbers Oct 10, 2025
9ec9622
Merge branch 'main' into async-engine
finbarrtimbers Oct 10, 2025
6c58ad6
Merge branch 'main' into async-engine
finbarrtimbers Oct 14, 2025
c6eec7a
Merge branch 'main' into async-engine
finbarrtimbers Oct 15, 2025
e2824d5
Ran uv sync
finbarrtimbers Oct 15, 2025
8a81bcf
Simpler pyproject.toml
finbarrtimbers Oct 15, 2025
76bbbf0
Updated uv.lock
finbarrtimbers Oct 15, 2025
d7f3f78
Updated.
finbarrtimbers Oct 16, 2025
eab53dd
Cleaned up code.
finbarrtimbers Oct 16, 2025
8eaeaad
Updated code to correctly calculate kv_cache_info
finbarrtimbers Oct 16, 2025
3e77b18
Updated tests.
finbarrtimbers Oct 16, 2025
599e5b0
Updated code to use public APIs.
finbarrtimbers Oct 16, 2025
de8ad2b
Updated kv cache method
finbarrtimbers Oct 16, 2025
a8645b1
Another attempt at fixing kv cache specs
finbarrtimbers Oct 16, 2025
353561a
Updated code
finbarrtimbers Oct 16, 2025
a7dc5c7
Added health check on loop
finbarrtimbers Oct 16, 2025
ff23618
Attempt at fixing
finbarrtimbers Oct 16, 2025
04cca39
updated kv cache code
finbarrtimbers Oct 16, 2025
7b5a7d4
removed recursive health check
finbarrtimbers Oct 16, 2025
16ee997
Merge branch 'main' into async-engine
finbarrtimbers Oct 16, 2025
f3aaff1
Merge branch 'main' into async-engine
finbarrtimbers Oct 16, 2025
5a625df
Removed constant
finbarrtimbers Oct 16, 2025
9a04eba
fix tool use
hamishivi Oct 17, 2025
c7e403f
prevent concurrency issue
hamishivi Oct 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions open_instruct/grpo_fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -870,7 +870,7 @@ def broadcast_to_vllm(self):
shape = param.shape if self.args.deepspeed_stage != 3 else param.ds_shape
refs = [
engine.update_weight.remote(
name, dtype=param.dtype, shape=shape, empty_cache=count == num_params
name, dtype=str(param.dtype), shape=shape, empty_cache=count == num_params
)
for engine in self.vllm_engines
]
Expand All @@ -884,7 +884,7 @@ def broadcast_to_vllm(self):
shape = param.shape if self.args.deepspeed_stage != 3 else param.ds_shape
refs = [
engine.update_weight.remote(
name, dtype=param.dtype, shape=shape, empty_cache=count == num_params
name, dtype=str(param.dtype), shape=shape, empty_cache=count == num_params
)
for engine in self.vllm_engines
]
Expand Down
6 changes: 2 additions & 4 deletions open_instruct/test_vllm_utils3.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def create_mock_logprobs(token_ids):
"is_eval": False,
"dataset_index": 43039,
"training_step": 1,
"prompt_tokens": 10,
"prompt_token_ids": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Token ID Mismatch in Mock Request

The prompt_token_ids in mock_request_output is inconsistent with the prompt_token_ids defined in request_metadata across two test cases. This mismatch in token counts could lead to inaccurate test validation.

Additional Locations (1)

Fix in Cursor Fix in Web

"start_time": 1000.0,
}
}
Expand All @@ -74,7 +74,6 @@ def create_mock_logprobs(token_ids):
result, is_eval = process_completed_request(
request_id="train_1_43039",
outs=[mock_request_output],
tracking={}, # Not used for this test
current_time=1001.0,
tools=tools,
request_metadata=request_metadata,
Expand Down Expand Up @@ -132,7 +131,7 @@ def create_mock_logprobs(token_ids):
"is_eval": True,
"dataset_index": 200,
"training_step": 2,
"prompt_tokens": 5,
"prompt_token_ids": [1, 2, 3, 4, 5],
"start_time": 2000.0,
}
}
Expand All @@ -141,7 +140,6 @@ def create_mock_logprobs(token_ids):
result, is_eval = process_completed_request(
request_id="eval_2_200",
outs=[mock_request_output],
tracking={}, # Not used for this test
current_time=2000.5,
tools=None,
request_metadata=request_metadata,
Expand Down
Loading