Sparse MoE code reading #28

long8v · 2022-05-10T07:42:01Z

Paper

https://arxiv.org/abs/1701.06538
논문정리 : notion

구현체

https://github.com/davidmrau/mixture-of-experts

long8v · 2022-05-10T07:59:01Z

MoE/moe.py

+from torch.distributions.normal import Normal
+from mlp import MLP
+import numpy as np
+class SparseDispatcher(object):


input 미니 배치가 있을 때, 이를 각각의 expert에 넘겨주는 dispatch, 각 expert의 결과물을 모아서 하나의 tensor로 만드는 combine을 하기 위한 헬퍼 함수.

long8v · 2022-05-11T02:33:17Z

MoE/moe.py

+    combine - take output Tensors from each expert and form a combined output
+      Tensor.  Outputs from different experts for the same batch element are
+      summed together, weighted by the provided "gates".


combine은 각각의 expert의 gate에 대한 weighted sum을 해줌

long8v · 2022-05-11T02:34:48Z

MoE/moe.py

+      summed together, weighted by the provided "gates".
+    The class is initialized with a "gates" Tensor, which specifies which
+    batch elements go to which experts, and the weights to use when combining
+    the outputs.  Batch element b is sent to expert e iff gates[b, e] != 0.


gates는 one hot vector. batch b가 expert e로 가면 1, 아니면 0

long8v · 2022-05-11T02:43:50Z

MoE/moe.py

+        self._gates = gates
+        self._num_experts = num_experts
+        # sort experts
+        sorted_experts, index_sorted_experts = torch.nonzero(gates).sort(0)


nonzero -> 0이 아닌 tensor의 index뽑기 -> value 작은 순서대로 sort하기

long8v · 2022-05-11T02:47:15Z

MoE/moe.py

+        # sort experts
+        sorted_experts, index_sorted_experts = torch.nonzero(gates).sort(0)
+        # drop indices
+        _, self._expert_index = sorted_experts.split(1, dim=1)


dim=1에서 1씩 다른 텐서로 자르고 맨처음만 빼고 다시 self._exepert_index 로 저장함.
https://pytorch.org/docs/stable/generated/torch.split.html

long8v · 2022-05-11T03:05:18Z

MoE/moe.py

+
+        threshold_positions_if_in = torch.arange(batch).to(self.device) * m + self.k
+        threshold_if_in = torch.unsqueeze(torch.gather(top_values_flat, 0, threshold_positions_if_in), 1)
+        is_in = torch.gt(noisy_values, threshold_if_in)


greater than. https://pytorch.org/docs/stable/generated/torch.gt.html
element-wise로 noisy_values > theshold_if_in이면 true

long8v · 2022-05-11T03:08:18Z

MoE/moe.py

+        noisy_top_values: a `Tensor` of shape [batch, m].
+           "values" Output of tf.top_k(noisy_top_values, m).  m >= k+1


top m개의 expert 텐서

long8v · 2022-05-11T03:10:33Z

MoE/moe.py

+              and shapes `[expert_batch_size_i]`
+        """
+        # split nonzero gates for each expert
+        return torch.split(self._nonzero_gates, self._part_sizes, dim=0)


_nonzero_gate들을 각각 expert에 넣어줄 배치개수 만큼 잘라줌.

long8v · 2022-05-11T03:13:50Z

MoE/moe.py

+        """The squared coefficient of variation of a sample.
+        Useful as a loss to encourage a positive distribution to be more uniform.


텐서의 coefficient of variation을 구함.

long8v · 2022-05-11T03:16:45Z

MoE/moe.py

+    def _gates_to_load(self, gates):
+        """Compute the true load per expert, given the gates.
+        The load is the number of examples for which the corresponding gate is >0.


gate들이 있을 때 각 expert 별 true load(noisy 하지 않다는 뜻인듯)를 계산.
gate > 0인 example의 개수를 load로 정의.

long8v · 2022-05-13T01:35:05Z

MoE/moe.py

+        loss = self.cv_squared(importance) + self.cv_squared(load)
+        loss *= loss_coef


importance에 대한 loss와 load에 대한 loss 합침

long8v · 2022-05-13T01:35:26Z

MoE/moe.py

+    def noisy_top_k_gating(self, x, train, noise_epsilon=1e-2):
+        """Noisy top-k gating.
+          See paper: https://arxiv.org/abs/1701.06538.
+          Args:
+            x: input Tensor with shape [batch_size, input_size]
+            train: a boolean - we only add noise at training time.
+            noise_epsilon: a float
+          Returns:
+            gates: a Tensor with shape [batch_size, num_experts]
+            load: a Tensor with shape [num_experts]
+        """


top-k gate noisy 하게 만드는 부분

long8v · 2022-05-13T02:03:27Z

MoE/moe.py

+        top_k_gates = self.softmax(top_k_logits)
+
+        zeros = torch.zeros_like(logits, requires_grad=True).to(self.device)
+        gates = zeros.scatter(1, top_k_indices, top_k_gates).to(self.device)


https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_.html#torch.Tensor.scatter_

jy.nam added 2 commits May 10, 2022 07:40

add moe code review

51375e2

remove submodule

64942df

long8v commented May 11, 2022

View reviewed changes

long8v changed the title ~~add MoE code review~~ Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer May 13, 2022

long8v added 2017 MoE labels May 13, 2022

long8v commented May 13, 2022

View reviewed changes

long8v changed the title ~~Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer~~ Sparse MoE code reading Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse MoE code reading #28

Sparse MoE code reading #28

long8v commented May 10, 2022 •

edited

Loading

long8v May 10, 2022

long8v May 11, 2022

long8v May 11, 2022

long8v May 11, 2022

long8v May 11, 2022

long8v May 11, 2022

long8v May 11, 2022

long8v May 11, 2022

long8v May 11, 2022 •

edited

Loading

long8v May 11, 2022 •

edited

Loading

long8v May 13, 2022

long8v May 13, 2022

long8v May 13, 2022

		noisy_top_values: a `Tensor` of shape [batch, m].
		"values" Output of tf.top_k(noisy_top_values, m). m >= k+1

		"""The squared coefficient of variation of a sample.
		Useful as a loss to encourage a positive distribution to be more uniform.

		loss = self.cv_squared(importance) + self.cv_squared(load)
		loss *= loss_coef

Sparse MoE code reading #28

Are you sure you want to change the base?

Sparse MoE code reading #28

Conversation

long8v commented May 10, 2022 • edited Loading

Paper

구현체

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long8v May 11, 2022 • edited Loading

Choose a reason for hiding this comment

long8v May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long8v commented May 10, 2022 •

edited

Loading

long8v May 11, 2022 •

edited

Loading

long8v May 11, 2022 •

edited

Loading