Onboard DeepSeek MoE with shared experts #1242

RissyRan · 2025-02-06T19:30:27Z

Description

Onboard DeepSeek MoE with shared experts (functional first - reference from DeepSeek):

Refactor models.py to handle mixed layers, i.e. both dense and moe layers
Add DeepSeek v3 config and deepseek.py as decoder layer
Add DeepSeekMoeBlock to reuse the dense and moe blocks so we could use either dropping or dropless for future tuning
Add a compile test

Tests

base_num_decoder_layers: 5 & num_experts: 16

Small config - Functional tests for scan_layers=True: link
Small config - Functional tests for scan_layers=False: link
One profile- matmul shapes on both dense and moe LGTM

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

gagika

Only a few minor comments, Thanks!

gagika · 2025-02-07T18:45:04Z

MaxText/layers/deepseek.py

+
+
+class DeepSeekMoELayer(nn.Module):
+  """DeepSeek-style MoE layer."""


Can you comment what are main differences of DeepSeekMoELayer from regular MoELayer?

gagika · 2025-02-07T18:51:14Z

MaxText/layers/linears.py

@@ -371,7 +378,11 @@ def permute(self, inputs, gate_logits):
    inputs_shape = inputs.shape
    inputs_2d = jnp.reshape(inputs, (inputs_shape[0] * inputs_shape[1], inputs_shape[2]))
    weights, selected_experts = jax.lax.top_k(gate_logits, self.num_experts_per_tok)
-    weights = jax.nn.softmax(weights.astype(jnp.float32), axis=-1).astype(self.dtype)
+    if self.config.decoder_block == "deepseek":


Can you add some comment how/why it's different for deepseek?

also, perhaps you can move it to a function, as it's used in 2 places, e.g.

def _deepseek_scale_weights(self, weights):
"""Scales weights according to DeepSeek's ... ."""
weights /= weights.sum(-1, keepdims=True)
weights *= self.config.routed_scaling_factor
return weights

RissyRan added 2 commits February 6, 2025 05:16

Onboard DeepSeek MoE with shared experts

739316b

Merge branch 'main' into shared_dp_toy

0071b58

RissyRan requested review from gobbleturk, khatwanimohit, bvandermoon and vipannalla as code owners February 6, 2025 19:30

RissyRan assigned gagika and gobbleturk Feb 6, 2025

Fix conflict

d83ab66

gagika approved these changes Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onboard DeepSeek MoE with shared experts #1242

Onboard DeepSeek MoE with shared experts #1242

RissyRan commented Feb 6, 2025 •

edited

Loading

gagika left a comment

gagika Feb 7, 2025

gagika Feb 7, 2025



		class DeepSeekMoELayer(nn.Module):
		"""DeepSeek-style MoE layer."""

Onboard DeepSeek MoE with shared experts #1242

Are you sure you want to change the base?

Onboard DeepSeek MoE with shared experts #1242

Conversation

RissyRan commented Feb 6, 2025 • edited Loading

Description

Tests

Checklist

gagika left a comment

Choose a reason for hiding this comment

gagika Feb 7, 2025

Choose a reason for hiding this comment

gagika Feb 7, 2025

Choose a reason for hiding this comment

RissyRan commented Feb 6, 2025 •

edited

Loading