Bamba

Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Installation

Besides PyTorch, you would need a few extra dependencies for Mamba models.

We found some of these dependencies picky on PyTorch versions when doing pip install, so the best way is to build from source for all Mamba dependencies if you hit dependency issue with your env:

git clone https://github.com/Dao-AILab/causal-conv1d.git
cd causal-conv1d && pip install . && cd ..
git clone https://github.com/state-spaces/mamba.git
cd mamba && pip install . && cd ..
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && pip install . && cd ..

For users using our HF versions of the model, you would need to install the latest transformers which includes our newly merged implementation for our Bamba models:

pip install git+https://github.com/huggingface/transformers.git

Models

Model	Params	# Layers	Hidden Dim.	Attention Heads	GQA	KV Heads	Context Length	Tied Embeddings
Bamba	9B (9.78B)	32	4096	32	Yes	8	4096	False

Checkpoints

You can find links to our model checkpoints here: Bamba Models

Inference

You can use the following command to perform text generation using one of our checkpoints provided above:

python text_generation.py --model_path ibm-fms/Bamba-9B --tokenizer_path ibm-fms/Bamba-9B --prompt "The largest living mammal on Earth is " --max_new_tokens 128

Training

Details on training can be found here.

Benchmark scores

Base pretrained models

Category	Benchmark	Bamba 9B (2.2T)
General	MMLU (5-shot)	60.77
	ARC-C (25-shot)	63.23
	GSM8K (5-shot)	36.77
	Hellaswag (10-shot)	81.8
	OpenbookQA (5-shot)	47.6
	Piqa (5-shot)	82.26
	TruthfulQA (0-shot)	49.21
	Winogrande (5-shot)	76.87
HF OpenLLM- V2*	MMLU-PRO (5-shot)	17.53
	BBH (3-shot)	17.4
	GPQA (0-shot)	4.14
	IFEval (0-shot)	15.16
	MATH Lvl 5 (4-shot)	1.66
	MuSR (0-shot)	9.59
Safety Tasks	PopQA (5-shot)	20.5
	Toxigen (5-shot)	57.4
	BBQ (5-shot)	44.2
	Crows-pairs english (5-shot)	70.78

*For the v2 leaderboard results, we perform normalization and report the normalized results.

Further details on our evaluation and normalization detailes along with run and analysis scripts can be found here.

Fine-tuning

This example shows how to fine tune the bamba model for a specific task using SFT Trainer.

Quantization

We can create a (FP8) quantized model using fms-model-optimizer, which will make the storage and inference even more efficient.

python -m fms_mo.run_quant \
    --model_name_or_path <"path_to_original_model"> \
    --quant_method fp8 \
    --torch_dtype bfloat16 \
    --output_dir <"path_to_save_new_model">

Model size comparison before and after FP8:

	original	quantized
memory (total)	39.12 GB	10.83 GB
memory (break-down)	`torch.float32` 39.12 GB	`torch.bfloat16` 2.10 GB `torch.float8_e4m3fn` 8.73 GB

More details about fms-model-optimizer can be found here.

Llama.cpp

There is preliminary work to enable running Bamba architecture models using llama.cpp. This is work-in-progress, so should only be used as a guide for the adventurous!

Known Limitations

Currently, inference is only supported on CPUs
Models quantized with llama-quantize exhibit bad performance

Setup

To enable Bamba support, you'll need to build from source using Gabe's fork.

git clone --branch BambaArchitecture [email protected]:gabe-l-hart/llama.cpp.git
cd llama.cpp
mkdir build
cd build
# NOTE: To build with debug symbols and extra logging, use CMAKE_BUILD_TYPE=Debug
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j

Conversion to GGUF

You can use a pre-converted GGUF file from Huggingface (e.g. bamba-9b.gguf). If one doesn't exist, you can use the convert_hf_to_gguf.py script from Gabe's fork to perform the conversion manually.

# Install the python dependencies
cd /path/to/llama.cpp
pip install -r requirements/requirements-convert_hf_to_gguf.txt

# Perform the conversion
./convert_hf_to_gguf.py /path/to/bamba-model --outfile /path/to/bamba-model/bamba-model.gguf

Run with llama-cli

# Run the model with no layers on the GPU (CPU-only)
cd /path/to/llama.cpp
./bin/llama-cli  -ngl 0 -m /path/to/bamba-model/bamba-model.gguf -p "Tell me a story about a developer and their dog"

Quantization with llama-quantize

You can (optionally) quantize the GGUF model using llama.cpp's built in quantizaiton tool llama-quantize.

# Run the quantization (see llama-quantize --help for all quant types)
cd /path/to/llama.cpp
./build/bin/llama-quantize /path/to/bamba-model/bamba-model.gguf Q4_K_M

Contributors

Data collection and curation: We acknowledge and thank AllenAI team for making a high quality open source dataset Dolma as well as Hugging Face data team for making FineWeb-edu and Cosmopedia available. These are tremendous contributions which enabled us to create the model.
Data preprocessing: We thank IBM's internal data preprocessing team, specifically Tuan Hoang Trong, Syed Zawad, Jay Gala, and Ryan Gordon for helping tokenize the data at scale. The code for tokenization is available here.
Model architecture: The model architecture design was jointly done by Princeton, CMU, IBM, and UIUC and involved the following folks: Tri Dao (Princeton), Albert Gu (CMU), Linsong Chu (IBM), Davis Wertheimer (IBM), Minjia Zhang (UIUC), Mudhakar Srivatsa (IBM), and Raghu Ganti (IBM).
Model training: Model training was performed primarily by the IBM team using the Mamba2 kernels and layer implementation from Tri Dao and Albert Gu. The following folks from IBM were primarily involved: Linsong Chu, Divya Kumari, Davis Wertheimer, Raghu Ganti, and Dakshi Agrawal.
Model tuning: Tuning of the model was enabled and verified in TRL by the IBM team, involving Sukriti Sharma and Anh Uong.
Model inference: Model inference in transformers, vLLM, and llama.cpp builds on the kernels written by Princeton and CMU. The IBM team is working with the community to enable it in various ecosystems. The team includes Fabian Lim, Antoni viros i Martin, Adnan Hoque, Jamie Yang, Nelson Nimura Gonzalez, Joshua Rosenkranz, Nick Hill, and Gabe Goodhart.
Quantization: Quantization is led by the IBM team - Naigang Wang and Charlie Liu.
Evaluations: Evaluations are led by a team in IBM with long context evaluations being performed by UIUC, involving the following folks: Yotam Perlitz, Ofir Arviv, Michal Shmueli-Scheuer (IBM), Haoechen Shen, and Minjia Zhang (UIUC).

Finally, we would like to thank our leadership for their support in this effort - Priya Nagpurkar, David Cox, Sriram Raghavan, Aya Soffer, Ruchir Puri, and Mukesh Khare.

We would also like to thank the community, in particular Pablo Montalvo-Leroux, Aritra Roy Gosthipaty, and Vaibhav Srivastav from Hugging Face and Stas Bekman from Contextual AI who provided valuable feedback to this blog and the PRs into transformers. Further, we would like to thank Tyler Michael Smith from Neural Magic, who is shepherding the integration with vLLM.

A huge shoutout to Meta PyTorch, AllenAI, and Hugging Face teams for their contributions to the open initative, PyTorch FSDP allowed us to smoothly train this model and the data from Dolma and Fineweb/Cosmopedia made this model today!

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
blog		blog
evaluation		evaluation
training		training
tuning		tuning
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
bamba.jpeg		bamba.jpeg
code-of-conduct.md		code-of-conduct.md
text_generation.py		text_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bamba

Installation

Models

Checkpoints

Inference

Training

Benchmark scores

Base pretrained models

Fine-tuning

Quantization

Llama.cpp

Known Limitations

Setup

Conversion to GGUF

Run with llama-cli

Quantization with llama-quantize

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 13

Uh oh!

Languages

License

foundation-model-stack/bamba

Folders and files

Latest commit

History

Repository files navigation

Bamba

Installation

Models

Checkpoints

Inference

Training

Benchmark scores

Base pretrained models

Fine-tuning

Quantization

Llama.cpp

Known Limitations

Setup

Conversion to GGUF

Run with llama-cli

Quantization with llama-quantize

Contributors

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 13

Uh oh!

Languages

Packages