From e8008ab823c0d184c463b53b53dcdc8f50ba2d29 Mon Sep 17 00:00:00 2001 From: djliden <7102904+djliden@users.noreply.github.com> Date: Fri, 8 Mar 2024 16:56:18 -0600 Subject: [PATCH] update with revised training size --- .../4_olmo_1b_instruction_tune/4_olmo_instruction_tune.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/4_olmo_1b_instruction_tune/4_olmo_instruction_tune.ipynb b/notebooks/4_olmo_1b_instruction_tune/4_olmo_instruction_tune.ipynb index fb2ae54..c868ee0 100644 --- a/notebooks/4_olmo_1b_instruction_tune/4_olmo_instruction_tune.ipynb +++ b/notebooks/4_olmo_1b_instruction_tune/4_olmo_instruction_tune.ipynb @@ -726,7 +726,7 @@ "\n", "#### Split the dataset into training and validation\n", "\n", - "Here we also limit to a training subset of 5,000 examples. This is based on the [LIMIT](https://www.databricks.com/blog/limit-less-more-instruction-tuning) paper, which found that a small number of high-quality examples is sufficient for instruction-tuning. Note that, under ideal circumstances, we would choose more *domain-specific* examples with a variety of different formats. Given that we are not tailoring this fine-tuning job for a specific domain, we will just choose 5,000 random examples from the SlimOrca dataset." + "Here we also limit to a training subset of 10,000 examples. This is based on the [LIMIT](https://www.databricks.com/blog/limit-less-more-instruction-tuning) paper, which found that a small number of high-quality examples is sufficient for instruction-tuning. Under ideal circumstances, we would choose more *domain-specific* examples with a variety of different formats. Given that we are not tailoring this fine-tuning job for a specific domain, we will just choose 10,000 random examples from the SlimOrca dataset. We could almost certainly get by with fewer examples, especially if those examples were selected for quality and tailored to the specific tasks we want the model to succeed at." ] }, {