Skip to content
This repository has been archived by the owner on Oct 14, 2024. It is now read-only.

Commit

Permalink
add: further explaination + research to back up
Browse files Browse the repository at this point in the history
  • Loading branch information
hahuyhoang411 committed Mar 26, 2024
1 parent 62c3328 commit 620de3e
Showing 1 changed file with 12 additions and 2 deletions.
14 changes: 12 additions & 2 deletions blog/2024-03-25-data-is-a-moat.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,14 @@ Catastrophic forgetting can be visualized as a ball in a multidimensional landsc

Figure 2. [Gradient decent demonstration](https://en.wikipedia.org/wiki/Gradient_descent)

### Smoothing Distribution Shifts

The original dataset ensures smoother distribution shifts when introducing new information, as it embodies a comprehensive spectrum of prior knowledge. This continuity in knowledge transition helps in maintaining the robustness of the model against sudden changes, akin to providing a more gradual learning curve where the new information is incrementally integrated with the existing knowledge base. This concept is supported by the [EleutherAI's research](https://arxiv.org/abs/2403.08763) highlighting the importance of how tasks are sequenced in the learning process, suggesting that introducing dissimilar tasks early on can expand the network's capacity for new information

### Acting as a Noise Mask

- The original data can serve as a form of noise masking, similar to techniques used in training [early computer vision models](https://arxiv.org/abs/1911.04252). This approach introduces a level of variability or ["noise"](https://arxiv.org/abs/2310.05914) during training, which can prevent the model from overfitting to the new dataset. By retaining a mix of old and new data, the model is exposed to a broader range of scenarios, enhancing its generalization capabilities and robustness across tasks.

## Viable Solutions

Overcoming these challenges requires a balanced approach. One method involves inundating the model with extensive, quality data, allowing for comprehensive fine-tuning. While effective, this demands significant computational resources and the cost of gathering millions of top-rated GPT-4 and human reponses. Examples include [OpenChat](https://huggingface.co/openchat/openchat-3.5-0106) and [OpenHermes](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B), which demonstrate the trade-offs between data quantity, quality, and computational demands.
Expand All @@ -49,5 +57,7 @@ The ownership and strategic use of original data serve as an invisible moat. It

## Reference
- [Catastrophic forgetting](https://arxiv.org/abs/2308.08747)
- [Simple and Scalable Strategies to Continually Pre-train Large Language Models](https://arxiv.org/pdf/2403.08763.pdf)
- [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)
- [Simple and Scalable Strategies to Continually Pre-train Large Language Models](https://arxiv.org/abs/2403.08763)
- [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)
- [Neftune](https://arxiv.org/abs/2310.05914)
- [Self-training with Noisy Student improves ImageNet classification](https://arxiv.org/abs/1911.04252)

0 comments on commit 620de3e

Please sign in to comment.