update with revised training size

djliden · Mar 8, 2024 · e8008ab · e8008ab
1 parent 42d53d2
commit e8008ab
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/notebooks/4_olmo_1b_instruction_tune/4_olmo_instruction_tune.ipynb b/notebooks/4_olmo_1b_instruction_tune/4_olmo_instruction_tune.ipynb
@@ -726,7 +726,7 @@
     "\n",
     "#### Split the dataset into training and validation\n",
     "\n",
-    "Here we also limit to a training subset of 5,000 examples. This is based on the [LIMIT](https://www.databricks.com/blog/limit-less-more-instruction-tuning) paper, which found that a small number of high-quality examples is sufficient for instruction-tuning. Note that, under ideal circumstances, we would choose more *domain-specific* examples with a variety of different formats. Given that we are not tailoring this fine-tuning job for a specific domain, we will just choose 5,000 random examples from the SlimOrca dataset."
+    "Here we also limit to a training subset of 10,000 examples. This is based on the [LIMIT](https://www.databricks.com/blog/limit-less-more-instruction-tuning) paper, which found that a small number of high-quality examples is sufficient for instruction-tuning. Under ideal circumstances, we would choose more *domain-specific* examples with a variety of different formats. Given that we are not tailoring this fine-tuning job for a specific domain, we will just choose 10,000 random examples from the SlimOrca dataset. We could almost certainly get by with fewer examples, especially if those examples were selected for quality and tailored to the specific tasks we want the model to succeed at."
    ]
   },
   {