BitGPT: A 1-bit version of the GPT Language model, inspired from Andrej Karpathy's tutorial on building a GPT from scratch.

BitGPT is an attempt at including the best practices of building a language model, while providing the user with as much accessibility and flexibility as possible. The 1-bit version is adapted from the paper The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, originally created for the LLama model.

Like Andrej's tutorial, the model currently contains only a minimal decoder-only architecture (instead of an entire transformer), however, I hope to add some missing elements (like RoPE), and other possible components and techniques to integrate the best performing parts of the current open-source models. The project is divided into organised categories, for both easy comparison and understanding. Each present (and hopefully, any future) subdirectory will usually contain a model.py, defining the architecture, and a train.py, a script used for training the model.

The training script will allow you to train everything from a model with as little as 50k parameters to ones over 1B parameters.

Training

Simply clone the repository

git clone https://github.com/bananya-ml/BitGPT

install the required dependencies (remember to use a virtual environment!)

$ pip install -r ./requirements.txt

Remember, this downloads a CUDA-enabled version of PyTorch, which can take a while. If you don't have a CUDA-capable system, or don't wish to use CUDA for whatever reason, simply install PyTorch using

$ pip install torch

Once the installation is complete, from the root directory, you can run

python ./gpt/train.py

or

python ./bitgpt/train.py

to train and save a model with the default settings. You might want to play around with the hyperparameters to balance speed and quality of your trained model.

Name	Description	Type	Default Values
--batch-size	Batch size for training	int	64
--block-size	Maximum context length for predictions	int	256
--max-iters	Number of epochs to train	int	500000
--eval-iters	Number of batches used to estimate loss during eval	int	200
--eval-interval	Interval after which eval is performed	int	2000
--lr	Learning rate	float	6e-4
--n-head	Number of heads in the transformer architecture	float	4
--n-layer	Number of layers of the transformer architecture	float	4
--n-embd	Embedding dimension	float	384
--dropout,--d	Dropout value	float	0.2
--weight-decay	Weight decay	float	1e-1
--decay-lr	Flag for learning rate decay	bool	True
--warmup-iters	Steps to warmup lr decay	int	200
--lr-decay-iters	Should be ~= max_iters per Chinchilla	int	500000
--min-lr	Should be learning rate/10 per Chinchilla	int	6e-5
--wandb-log	Logging using wandb (need to login to wandb first)	bool	False
--seed	Random seed	int	1337
--verbose	1 = recommended tunable parameters, 2 = all parameters	int	0

Inference

Each directory containing a model.py will also contain a generate.py that can be used as

python ./gpt/generate.py

from the root directory of the project. The following arguments can be used with the generate.py file to tune the output:

Name	Description	Type	Default Values
--prompt	Generation from the model follows the prompt	str	''
--num-samples	Number of samples to generate	int	2
--max-new-tokens	Maximum context length for predictions	int	2000
--temperature	1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions	float	1.0
--top-k	Retain only the top_k most likely tokens, clamp others to have 0 probability	int	200

NOTE According to an FAQ released by Microsoft, BitLinear layers require a low bit GEMM kernel during inference. No particular implementation of a kernel is provided by the paper, so we use an unofficial implementation of our own. Until such a time as the authors of The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits release an implementation, I will assume the kernel does not make a significant difference in the quality of inference.

Data

The data directory contains 2 files: shakespeare.txt which contains 40,000 lines from William Shakespeare's writing and astro.txt which contains about 35,000 lines stripped from research papers around massive stars, machine learning and spectroscopy. Either file can be chosen as the training data, and any other text file can be placed in the directory and be used as training material, after changing the relevant part of the code to be use the custom dataset.

I will, in the future, try to add support for more types of datasets, e.g. an instruction dataset, as I add greater functionality to use the trained model, e.g. as a chatbot.

TODO

Rotatory Positonal Embedding RoPE
BPE encoding for training and inference
Chat style inference

License

References

Shuming Ma and Hongyu Wang and Lingxiao Ma and Lei Wang and Wenhui Wang and Shaohan Huang and Li Dong and Ruiping Wang and Jilong Xue and Furu Wei, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, https://doi.org/10.1093/mnras/staa3540
Jianlin Su and Yu Lu and Shengfeng Pan and Ahmed Murtadha and Bo Wen and Yunfeng Liu, RoFormer: Enhanced Transformer with Rotary Position Embedding, https://doi.org/10.48550/arXiv.2104.09864

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
bitgpt		bitgpt
data		data
gpt		gpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitGPT: A 1-bit version of the GPT Language model, inspired from Andrej Karpathy's tutorial on building a GPT from scratch.

Training

Inference

Data

TODO

License

References

About

Releases

Packages

Languages

License

bananya-ml/BitGPT

Folders and files

Latest commit

History

Repository files navigation

BitGPT: A 1-bit version of the GPT Language model, inspired from Andrej Karpathy's tutorial on building a GPT from scratch.

Training

Inference

Data

TODO

License

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages