add: further explaination + research to back up

janhq · Mar 26, 2024 · 620de3e · 620de3e
1 parent 62c3328
commit 620de3e
Showing 1 changed file with 12 additions and 2 deletions.
diff --git a/blog/2024-03-25-data-is-a-moat.md b/blog/2024-03-25-data-is-a-moat.md
@@ -39,6 +39,14 @@ Catastrophic forgetting can be visualized as a ball in a multidimensional landsc
 
 Figure 2. [Gradient decent demonstration](https://en.wikipedia.org/wiki/Gradient_descent)
 
+### Smoothing Distribution Shifts
+
+The original dataset ensures smoother distribution shifts when introducing new information, as it embodies a comprehensive spectrum of prior knowledge. This continuity in knowledge transition helps in maintaining the robustness of the model against sudden changes, akin to providing a more gradual learning curve where the new information is incrementally integrated with the existing knowledge base. This concept is supported by the [EleutherAI's research](https://arxiv.org/abs/2403.08763) highlighting the importance of how tasks are sequenced in the learning process, suggesting that introducing dissimilar tasks early on can expand the network's capacity for new information
+
+### Acting as a Noise Mask
+
+- The original data can serve as a form of noise masking, similar to techniques used in training [early computer vision models](https://arxiv.org/abs/1911.04252). This approach introduces a level of variability or ["noise"](https://arxiv.org/abs/2310.05914) during training, which can prevent the model from overfitting to the new dataset. By retaining a mix of old and new data, the model is exposed to a broader range of scenarios, enhancing its generalization capabilities and robustness across tasks.
+
 ## Viable Solutions
 
 Overcoming these challenges requires a balanced approach. One method involves inundating the model with extensive, quality data, allowing for comprehensive fine-tuning. While effective, this demands significant computational resources and the cost of gathering millions of top-rated GPT-4 and human reponses. Examples include [OpenChat](https://huggingface.co/openchat/openchat-3.5-0106) and [OpenHermes](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B), which demonstrate the trade-offs between data quantity, quality, and computational demands.
@@ -49,5 +57,7 @@ The ownership and strategic use of original data serve as an invisible moat. It
 
 ## Reference
 - [Catastrophic forgetting](https://arxiv.org/abs/2308.08747)
-- [Simple and Scalable Strategies to Continually Pre-train Large Language Models](https://arxiv.org/pdf/2403.08763.pdf)
-- [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)
+- [Simple and Scalable Strategies to Continually Pre-train Large Language Models](https://arxiv.org/abs/2403.08763)
+- [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)
+- [Neftune](https://arxiv.org/abs/2310.05914)
+- [Self-training with Noisy Student improves ImageNet classification](https://arxiv.org/abs/1911.04252)