[Docs] add integrate_with_tunix.md #202

aolemila · 2025-09-19T07:32:23Z

No description provided.

wang2yn84 · 2025-09-22T20:01:15Z

Thank you very much for the proposal! Generally look good to me. There is one thing I'd like to pointed out, we have 2 modes to run Tunix: Multi Controller Jax or Pathways. In order to comply with Pathways single process requirement, we need to run sglang backend in the main process. Can you describe what changes are necessary in the proposal?

wang2yn84

Thank you!

docs/features/integrate_with_tunix.md

jimoosciuc · 2025-09-23T03:48:06Z

Thank you very much for the proposal! Generally look good to me. There is one thing I'd like to pointed out, we have 2 modes to run Tunix: Multi Controller Jax or Pathways. In order to comply with Pathways single process requirement, we need to run sglang backend in the main process. Can you describe what changes are necessary in the proposal?

@wang2yn84 Currently we have relatively little understanding of Pathways, and we're not quite clear on how it implements single controller + distributed computing, as well as how to use and integrate it. It would be great if there were some more detailed documentation available.

docs/features/integrate_with_tunix.md

wang2yn84

Thank you!

docs/features/integrate_with_tunix.md

wang2yn84 · 2025-09-24T05:50:33Z

docs/features/integrate_with_tunix.md

+
+        nnx.update(self._model, current_state)
+    else:
+        nnx.update(self._model, updated_weights)


Tunix implementation of the training model might be different from SGLang's version. So the weight name mapping, or potentially transposing or other operations might be necessary. Really depends on the model implementation on both sides.

Got it! Thanks for your suggestion. cc @JamesBrianD

@wang2yn84 Currently I find out model(like llama3) implements methods like to_hf_mappings, lora_to_hf_mappings, etc. These methods are used to update_params. Different inference engines have different weight names, will Tunix follow this solution to implement every mapping for different framework every model? cc @JamesBrianD

Another question: Not all models implement all methods. For example, llama3 does not implement to_hf_hook_fns, but model.to_hf_hook_fns() will be called when initializing VllmSampler in rl/rollout/vllm_rollout.py. This confuses me. Could you explain it for me?

self._sampler = vllm_sampler.VllmSampler( tokenizer=tokenizer, config=vllm_sampler.VllmConfig( max_model_len=cache_config_or_size, mesh=mesh, model_version=model_version, hbm_utilization=hbm_utilization, init_with_random_weights=init_with_random_weights, tpu_backend_type=tpu_backend_type, mapping_config=vllm_sampler.MappingConfig( to_hf_mappings=model.to_hf_mappings(), to_hf_transpose_keys=model.to_hf_transpose_keys(), to_hf_hook_fns=model.to_hf_hook_fns(), lora_to_hf_mappings=model.lora_to_hf_mappings(), lora_config=lora_config, ), ), )

@wang2yn84 Thank you for your suggestion. We will consider adding key_mappings and transpose_keys (following the example of utils.transfer_state_with_mappings) to handle the mapping and transformation between states.

wang2yn84 · 2025-09-24T05:53:19Z

Thank you very much for the proposal! Generally look good to me. There is one thing I'd like to pointed out, we have 2 modes to run Tunix: Multi Controller Jax or Pathways. In order to comply with Pathways single process requirement, we need to run sglang backend in the main process. Can you describe what changes are necessary in the proposal?

@wang2yn84 Currently we have relatively little understanding of Pathways, and we're not quite clear on how it implements single controller + distributed computing, as well as how to use and integrate it. It would be great if there were some more detailed documentation available.

Yea completely understand! Can you try https://arxiv.org/abs/2203.12533? I believe this paper can give you a high level view of the system. Let me know if it helps.

aolemila · 2025-09-24T08:03:46Z

Thank you very much for the proposal! Generally look good to me. There is one thing I'd like to pointed out, we have 2 modes to run Tunix: Multi Controller Jax or Pathways. In order to comply with Pathways single process requirement, we need to run sglang backend in the main process. Can you describe what changes are necessary in the proposal?

@wang2yn84 Currently we have relatively little understanding of Pathways, and we're not quite clear on how it implements single controller + distributed computing, as well as how to use and integrate it. It would be great if there were some more detailed documentation available.

Yea completely understand! Can you try https://arxiv.org/abs/2203.12533? I believe this paper can give you a high level view of the system. Let me know if it helps.

@wang2yn84

Based on paper you provided and our internal discussion, here are changes we need to do.

Change1: Currently Engine will create subprocess after initialization, in order to meet the single process requirement in Pathways, we will rewrite the related parts in Engine.

Change2: We get TPUs through jax.devices() in our codes, how can we get devices in Pathways?

Other discussions:

How to be aware of Pathway environment? According to Tunix codes, we decide to use env JAX_PLATFORMS=proxy.
How to get devices like 'jax.devices()' in Pathways?
After codes completion, is there any environment which allows us to run e2e test(Tunix with SglJaxRollout) in Pathways?

Note: Multi Controller Jax: Currently sglang-jax can run different hosts, so we think we support to run multi-controller Tunix.

cc @jimoosciuc @pathfinder-pf

pathfinder-pf · 2025-09-24T13:22:02Z

can you explain why needs single process metained in "In order to comply with Pathways single process requirement, we need to run sglang backend in the main process", as we all know ,sglang run with multi process, can sglang run using a single process ? @wang2yn84

aolemila · 2025-09-30T10:30:04Z

According to our communication in Google Chat, here are the conclusions about discussions which may influence the integration.

D1: How to be aware of Pathway environment?
C1: Try to import pathwaysutils. Success means in pathways, Fail means not in pathways.

Q2: How to get devices like 'jax.devices()' in Pathways?
C1: Use jax.devices().

D3: What are the usage of sampling seed?
Note: It is still under the discussion, and it may be needed with deterministic inference. We decide to support deterministic sampling firstly.

cc @jimoosciuc

aolemila linked an issue Sep 19, 2025 that may be closed by this pull request

[RFC] integrate with google/tunix #201

Open

wang2yn84 reviewed Sep 22, 2025

View reviewed changes

jimoosciuc closed this Sep 23, 2025

jimoosciuc reopened this Sep 23, 2025

aolemila commented Sep 24, 2025

View reviewed changes

docs/features/integrate_with_tunix.md Outdated Show resolved Hide resolved

wang2yn84 reviewed Sep 24, 2025

View reviewed changes

aolemila mentioned this pull request Sep 24, 2025

align sampling param ability according to rfc primatrix/sglang-jax#6

Merged

This was referenced Sep 25, 2025

add multinomial_with_seed for sampler and test_sampler.py primatrix/sglang-jax#12

Merged

[feature] integrate with tunix #214

Merged

aolemila and others added 12 commits September 30, 2025 19:13

add part content

26ffa96

update fields comparison for sampling

54637ea

add generate api defination

e40add1

add note

4b8dcb2

update sampling fields table

205d164

remove useless content

d481283

add output spec

300ff6d

add rollout docs

65870d2

update codes

87fa53c

fix

0b93bef

fix

8827ed2

adjust content order

5af1ec7

aolemila force-pushed the docs/integrate-with-tunix branch from f3c6eb2 to 545d112 Compare September 30, 2025 11:13

update rfc with latest conclusions

f167dfb

aolemila force-pushed the docs/integrate-with-tunix branch from 545d112 to f167dfb Compare September 30, 2025 11:18

[Docs] add integrate_with_tunix.md #202

Are you sure you want to change the base?

[Docs] add integrate_with_tunix.md #202

Uh oh!

Conversation

aolemila commented Sep 19, 2025

Uh oh!

wang2yn84 commented Sep 22, 2025

Uh oh!

wang2yn84 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jimoosciuc commented Sep 23, 2025

Uh oh!

Uh oh!

wang2yn84 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wang2yn84 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

aolemila Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

aolemila Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JamesBrianD Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

wang2yn84 commented Sep 24, 2025

Uh oh!

aolemila commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pathfinder-pf commented Sep 24, 2025

Uh oh!

aolemila commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aolemila Sep 24, 2025 •

edited

Loading

aolemila commented Sep 24, 2025 •

edited

Loading

aolemila commented Sep 30, 2025 •

edited

Loading