SEA-LION v2 Release (#10)

Updating repository content for SEA-LION v2 model release
aisingapore · Jul 31, 2024 · 9318b74 · 9318b74
1 parent c032219
commit 9318b74
Show file tree

Hide file tree

Showing 20 changed files with 279 additions and 104 deletions.
diff --git a/README.md b/README.md
@@ -2,144 +2,108 @@
 
 # <img align="center" src="images/purple_sealion-64x64.png"> A Family of Southeast Asian Language Models
 
-***Updated: 12 March 2024***
+***Updated: 31 July 2024***
 
 SEA-LION is a family of open-source language models developed by AI Singapore that better understands Southeast Asia's diverse contexts, languages, and cultures (SEA). We hope it makes LLMs more accessible and better represents the region's breadth of cultures and languages.
 
-## Truly Open Source
+Our first versions of SEA-LION, released in December 2023, were trained from scratched using [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile) (about 1 trillion tokens). Our new version of SEA-LION is based on continued pre-training good open source models. Version 2 is based on Llama 3. We believe that this approach i.e. continued pre-training might be more sustainable over the longer-run. 
 
-We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. We therefore make the following (open-source compliant) contributions:
+## Transparent and Open Source
+
+We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. SEA-LION will therefore be open and transparent in the following areas:
 
 1. *Pre-Training* data
 2. Model *training* code
 3. Model *weights*
 4. *Fine-Tuning* data
 5. Evaluation *benchmarks*
 
-## Key Features
+# LATEST MODELS
 
-- 3 to 7 billion parameters (larger models to be released through 2024)
-- Instruction-tuned in English and Bahasa Indonesia, with more to follow
-- Trained on 980B tokens of text data from 11 languages spoken across SEA
-- Specialized vocabulary and tokenization for optimal performance on SEA languages
-- Excels on tasks in regional languages
-- Open source under the MIT License for community contribution and adoption
+## Key Features of SEA-LION v2
 
-## Getting Started
-
-To use SEA-LION:
+- Continued Pre-Trained and Fine-Tuned Llama 3 (with more models to follow)
+- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil 
+- Trained with to 50B tokens from SEA languages
+- Outperforms base Llama 3 and other models in both general and SEA capabilities
+- Our contributions are open source (under MIT license); data and model licenses are listed on their respective Hugging Face data or model cards
 
-```python
-# please use transformers 4.34.1
-from transformers import AutoTokenizer, AutoModelForCausalLM
+See our [HuggingFace](https://huggingface.co/aisingapore/llama3-8b-cpt-sealionv2-instruct) for more detailed model and license information.
 
-tokenizer = AutoTokenizer.from_pretrained("aisingapore/sea-lion-3b", trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained("aisingapore/sea-lion-3b", trust_remote_code=True)
-
-tokens = tokenizer("Sea lion in the sea", return_tensors="pt")
-output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-### How To Download SEA-LION
+## How To Download SEA-LION v2
 
 SEA-LION models are available for download on HuggingFace at:
 
+### SEA-LION v2
 **Base Models**
-* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b)
-* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b)
+* [Llama3-8B-CPT-SEA-LION-V2-Base](https://huggingface.co/aisingapore/llama3-8b-cpt-sealionv2-base)
 
-**Instruction-Tuned**
-* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research)
-* **LATEST** [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct)
+**Instruction-Tuned Models**
+* [Llama3-8B-CPT-SEA-LION-V2-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sealionv2-instruct)
 
-## Model Details
+**Quantized Models**
+* [Llama3-8B-CPT-SEA-LION-V2-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-instruct-gguf)
 
-SEA-LION is based on the MPT architecture with 32 layers and comes in two sizes:
-
-- [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) : 3 billion parameters 
-- [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) : 7 billion parameters
-- [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research): 7 billion parameters, instruction-tuned in Bahasa Indonesia
-- **LATEST** [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct): 7 billion parameters, instruction-tuned in English and Bahasa Indonesia
-
-SEA-LION has been trained on a diverse dataset of 980B tokens spanning 11 natural languages:
-
-- English
-- Chinese
-- Indonesian
-- Malay
-- Thai
-- Vietnamese
-- Filipino
-- Tamil
-- Burmese
-- Khmer
-- Lao
-
-The dataset is available here [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile).
-
-The models use a vocabulary of 256,000 tokens and a context length of 2048 tokens. For tokenization, the model employs a custom SEA byte-pair encoding (BPE) tokenizer which is specially tailored for SEA languages, ensuring optimal model performance.
-
-## Benchmark
-
-We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted. These are tailored to Southeast Asia.
+## Getting Started
 
-The benchmark was introduced here [BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models](https://arxiv.org/abs/2309.06085v2) and [GitHub](https://github.com/aisingapore/bhasa).
+To use SEA-LION v2:
 
-## Performance
+```python
+# Please use transformers==4.43.2
 
-SEA-LION achieves better or competitive performances on tasks in regional languages:
+import transformers
+import torch
 
-| Model                          | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) |
-|--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------|
-| SEA-LION-7B-Instruct-Research  | 24.86   | 76.13          | 24.45         | 52.50             | 46.82             | 15.44             | 33.20     | 23.80        |
-| SEA-LION-7B-Instruct           | 68.41   | 91.45          | 17.98         | 57.48             | 58.04             | 17.54             | 53.10     | 60.80        |
-| SeaLLM 7B v1                   | 30.96   | 56.29          | 22.60         | 62.23             | 41.55             | 14.03             | 26.50     | 56.60        |
-| SeaLLM 7B v2                   | 44.40   | 80.13          | 55.24         | 64.01             | 63.28             | 17.31             | 43.60     | 82.00        |
-| Sailor-7B                      | 65.43   | 59.48          | 20.48         | 64.27             | 60.68             | 8.69              | 15.10     | 38.40        |
-| Llama 2 7B Chat                | 11.12   | 52.32          | 0.00          | 44.09             | 57.58             | 9.24              | 0.00      | 0.00         |
-| Mistral 7B Instruct v0.1       | 38.85   | 74.38          | 20.83         | 30.60             | 51.43             | 15.63             | 28.60     | 50.80        |
-| GPT-4                          | 73.60   | 74.14          | 63.96         | 69.38             | 67.53             | 18.71             | 83.20     | 96.00        |
+model_id = "aisingapore/llama3-8b-cpt-sealionv2-instruct"
 
-SEA-LION has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard):
+pipeline = transformers.pipeline(
+    "text-generation",
+    model=model_id,
+    model_kwargs={"torch_dtype": torch.bfloat16},
+    device_map="auto",
+)
+messages = [
+    {"role": "user", "content": "Apa sentimen dari kalimat berikut ini?\nKalimat: Buku ini sangat membosankan.\nJawaban: "},
+]
 
-| Model       | ARC   | HellaSwag | MMLU  | TruthfulQA | Average |
-|-------------|:-----:|:---------:|:-----:|:----------:|:-------:|
-| SEA-LION-7B | 39.93 | 68.51     | 26.87 |      35.09 | 42.60   |
+outputs = pipeline(
+    messages,
+    max_new_tokens=256,
+)
+print(outputs[0]["generated_text"][-1])
 
-For full details on the datasets, metrics, and results, please see the model cards:
+```
 
-* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b)
-* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b)
-* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research)
-* **LATEST** [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct)
+## Performance and Benchmarks
 
-## SEA-LION Demo
+SEA-LION achieves better or competitive performances on tasks in regional languages, while retaining the general performance of Llama 3.
 
-A video demo of SEA-LION is available [here](https://aisingapore.github.io/sealion/).
+Our [leaderboard is here](https://leaderboard.sea-lion.ai).
 
-## Prompting Guide
-A basic prompting guide is provided [here](docs/promptguide.md)
+We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted. These are tailored to Southeast Asia.
 
-## Pre-Training Config and Guide
+The benchmark was introduced here [BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models](https://arxiv.org/abs/2309.06085v2) and [GitHub](https://github.com/aisingapore/bhasa).
 
-SEA-LION 3B and 7B models are trained on 32 nodes of A100 40GB on AWS EC2.  
-The configuration used for pre-training and an overview guide is provided [here](pre-training/README-PRE-TRAINING.md).
+## Deployment Framework
 
-## QLoRA Fine-Tuning Guide
+### Text Generation Inference (TGI)
 
-The SEA-LION models can be fine-tuned using the HuggingFace TRL library.  
-An overview guide and sample configurations are provided [here](examples/fine-tuning/README.md).
+Please refer to [serving the SEA-LION model with TGI](https://github.com/aisingapore/sealion-tgi).
 
-## Deployment Framework
+### vLLM
 
-### Text-Generation-Inference (TGI)
+Please refer to [serving the SEA-LION model with vLLM](https://github.com/aisingapore/sealion-vllm).
 
-SEA-LION is natively supported in TGI from [v1.4.0](https://github.com/huggingface/text-generation-inference/releases/tag/v1.4.0).
+### Ollama
 
-### vLLM
+To run SEA-LION locally with Ollama via command line:
+1. [Download and install Ollama](https://ollama.com)
+2. Run and chat with SEA-LION with the following command
+   ```python
+   ollama run aisingapore/llama3-8b-cpt-sea-lionv2-instruct
+   ```
 
-For SEA-LION vLLM intergration, please refer to this [guide for instructions](https://github.com/aisingapore/sealion/tree/vllm/vllm).
+or [explore SEA-LION with Chainlit and Ollama here](https://github.com/aisingapore/sealion-chainlit-ollama)
 
 ## Contributing
 
@@ -152,27 +116,22 @@ Some ways to contribute:
 - Add more model evaluation tasks and metrics
 - Train versions of the model in more SEA languages
 
-## License
-
-SEA-LION is licensed under the [MIT License](LICENSE).
-
 ## To Cite SEA-LION
 
 If you use SEA-LION in your work, please cite it as:
 
 ```bibtex
-@misc{sea_lion_2023,
+@misc{sea_lion_2024,
   title={SEA-LION (Southeast Asian Languages In One Network): A Family of Large Language Models for Southeast Asia},
   author={AI Singapore},
-  year={2023},
+  year={2024},
   howpublished={\url{https://github.com/aisingapore/sealion}}
 }
 ```
 
 ## Acknowledgements
 
-AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
-Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore.
+AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore.
 
 ## Contact
 
@@ -190,3 +149,27 @@ For questions, comments, or issues, please open a GitHub issue or contact us via
     primaryClass={cs.CL}
 }
 ```
+# OTHER MODELS
+
+## SEA-LION v1
+
+- 3 to 7 billion parameters 
+- Instruction tuned in English and Bahasa Indonesia
+- Trained with 980B tokens of text data from 11 languages spoken across SEA
+- Specialized vocabulary and tokenization for optimal performance on SEA languages
+- Excels on tasks in regional languages
+- Open source under the MIT License for community contribution and adoption
+
+
+**Base Models**
+* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b)
+* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b)
+
+**Instruction-Tuned Models**
+* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research)
+* [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct)
+
+**Model Details**
+Please see model cards on Hugging Face.
+
+Additional information and guides about SEA-LION v1 can be found [here](sea-lion-v1/SEALIONV1_README.md)