From 3e2d21cd6126ce1915448b7f10136892121b2a05 Mon Sep 17 00:00:00 2001 From: Fardin <60337534+FardinHash@users.noreply.github.com> Date: Thu, 29 Sep 2022 02:22:17 +0600 Subject: [PATCH] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index da8f792a0..0b633fdee 100644 --- a/README.md +++ b/README.md @@ -286,14 +286,16 @@ We then train a large model (12-layer to 24-layer Transformer) on a large corpus (Wikipedia + [BookCorpus](http://yknzhu.wixsite.com/mbweb)) for a long time (1M update steps), and that's BERT. -Using BERT has two stages: *Pre-training* and *fine-tuning*. +Using BERT has two stages: **Pre-training** and **fine-tuning**. +## Pre-training **Pre-training** is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only, but multilingual models will be released in the near future). We are releasing a number of pre-trained models from the paper which were pre-trained at Google. Most NLP researchers will never need to pre-train their own model from scratch. +## Fine-tuning **Fine-tuning** is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be