From 9318b7404b25d30fbd481871a80d507efc9ce244 Mon Sep 17 00:00:00 2001 From: echoa Date: Wed, 31 Jul 2024 17:39:24 +0800 Subject: [PATCH] SEA-LION v2 Release (#10) Updating repository content for SEA-LION v2 model release --- README.md | 191 ++++++++--------- sea-lion-v1/SEALIONV1_README.md | 192 ++++++++++++++++++ {docs => sea-lion-v1/docs}/_config.yml | 0 {docs => sea-lion-v1/docs}/index.md | 0 {docs => sea-lion-v1/docs}/promptguide.md | 0 {docs => sea-lion-v1/docs}/sealion_demo.mp4 | Bin .../examples}/fine-tuning/README.md | 0 .../examples}/fine-tuning/data_functions.py | 0 .../fine-tuning/qlora_fine_tuning.py | 0 .../examples}/fine-tuning/requirements.txt | 0 .../examples}/fine-tuning/train_config.yaml | 0 .../examples}/inference/inference.py | 0 .../examples}/requirements.txt | 0 .../pre-training}/3B/launch.sh | 0 .../pre-training}/3B/launch.slurm | 0 .../pre-training}/3B/mpt-3b.yaml | 0 .../pre-training}/7B/launch.sh | 0 .../pre-training}/7B/launch.slurm | 0 .../pre-training}/7B/mpt-7b.yaml | 0 .../pre-training}/README-PRE-TRAINING.md | 0 20 files changed, 279 insertions(+), 104 deletions(-) create mode 100644 sea-lion-v1/SEALIONV1_README.md rename {docs => sea-lion-v1/docs}/_config.yml (100%) rename {docs => sea-lion-v1/docs}/index.md (100%) rename {docs => sea-lion-v1/docs}/promptguide.md (100%) rename {docs => sea-lion-v1/docs}/sealion_demo.mp4 (100%) rename {examples => sea-lion-v1/examples}/fine-tuning/README.md (100%) rename {examples => sea-lion-v1/examples}/fine-tuning/data_functions.py (100%) rename {examples => sea-lion-v1/examples}/fine-tuning/qlora_fine_tuning.py (100%) rename {examples => sea-lion-v1/examples}/fine-tuning/requirements.txt (100%) rename {examples => sea-lion-v1/examples}/fine-tuning/train_config.yaml (100%) rename {examples => sea-lion-v1/examples}/inference/inference.py (100%) rename {examples => sea-lion-v1/examples}/requirements.txt (100%) rename {pre-training => sea-lion-v1/pre-training}/3B/launch.sh (100%) rename {pre-training => sea-lion-v1/pre-training}/3B/launch.slurm (100%) rename {pre-training => sea-lion-v1/pre-training}/3B/mpt-3b.yaml (100%) rename {pre-training => sea-lion-v1/pre-training}/7B/launch.sh (100%) rename {pre-training => sea-lion-v1/pre-training}/7B/launch.slurm (100%) rename {pre-training => sea-lion-v1/pre-training}/7B/mpt-7b.yaml (100%) rename {pre-training => sea-lion-v1/pre-training}/README-PRE-TRAINING.md (100%) diff --git a/README.md b/README.md index 303c79b..2d7d621 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,15 @@ # A Family of Southeast Asian Language Models -***Updated: 12 March 2024*** +***Updated: 31 July 2024*** SEA-LION is a family of open-source language models developed by AI Singapore that better understands Southeast Asia's diverse contexts, languages, and cultures (SEA). We hope it makes LLMs more accessible and better represents the region's breadth of cultures and languages. -## Truly Open Source +Our first versions of SEA-LION, released in December 2023, were trained from scratched using [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile) (about 1 trillion tokens). Our new version of SEA-LION is based on continued pre-training good open source models. Version 2 is based on Llama 3. We believe that this approach i.e. continued pre-training might be more sustainable over the longer-run. -We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. We therefore make the following (open-source compliant) contributions: +## Transparent and Open Source + +We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. SEA-LION will therefore be open and transparent in the following areas: 1. *Pre-Training* data 2. Model *training* code @@ -16,130 +18,92 @@ We have benefited greatly from the open-source community and believe that effort 4. *Fine-Tuning* data 5. Evaluation *benchmarks* -## Key Features +# LATEST MODELS -- 3 to 7 billion parameters (larger models to be released through 2024) -- Instruction-tuned in English and Bahasa Indonesia, with more to follow -- Trained on 980B tokens of text data from 11 languages spoken across SEA -- Specialized vocabulary and tokenization for optimal performance on SEA languages -- Excels on tasks in regional languages -- Open source under the MIT License for community contribution and adoption +## Key Features of SEA-LION v2 -## Getting Started - -To use SEA-LION: +- Continued Pre-Trained and Fine-Tuned Llama 3 (with more models to follow) +- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil +- Trained with to 50B tokens from SEA languages +- Outperforms base Llama 3 and other models in both general and SEA capabilities +- Our contributions are open source (under MIT license); data and model licenses are listed on their respective Hugging Face data or model cards -```python -# please use transformers 4.34.1 -from transformers import AutoTokenizer, AutoModelForCausalLM +See our [HuggingFace](https://huggingface.co/aisingapore/llama3-8b-cpt-sealionv2-instruct) for more detailed model and license information. -tokenizer = AutoTokenizer.from_pretrained("aisingapore/sea-lion-3b", trust_remote_code=True) -model = AutoModelForCausalLM.from_pretrained("aisingapore/sea-lion-3b", trust_remote_code=True) - -tokens = tokenizer("Sea lion in the sea", return_tensors="pt") -output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id) -print(tokenizer.decode(output[0], skip_special_tokens=True)) -``` - -### How To Download SEA-LION +## How To Download SEA-LION v2 SEA-LION models are available for download on HuggingFace at: +### SEA-LION v2 **Base Models** -* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) -* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) +* [Llama3-8B-CPT-SEA-LION-V2-Base](https://huggingface.co/aisingapore/llama3-8b-cpt-sealionv2-base) -**Instruction-Tuned** -* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research) -* **LATEST** [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct) +**Instruction-Tuned Models** +* [Llama3-8B-CPT-SEA-LION-V2-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sealionv2-instruct) -## Model Details +**Quantized Models** +* [Llama3-8B-CPT-SEA-LION-V2-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-instruct-gguf) -SEA-LION is based on the MPT architecture with 32 layers and comes in two sizes: - -- [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) : 3 billion parameters -- [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) : 7 billion parameters -- [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research): 7 billion parameters, instruction-tuned in Bahasa Indonesia -- **LATEST** [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct): 7 billion parameters, instruction-tuned in English and Bahasa Indonesia - -SEA-LION has been trained on a diverse dataset of 980B tokens spanning 11 natural languages: - -- English -- Chinese -- Indonesian -- Malay -- Thai -- Vietnamese -- Filipino -- Tamil -- Burmese -- Khmer -- Lao - -The dataset is available here [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile). - -The models use a vocabulary of 256,000 tokens and a context length of 2048 tokens. For tokenization, the model employs a custom SEA byte-pair encoding (BPE) tokenizer which is specially tailored for SEA languages, ensuring optimal model performance. - -## Benchmark - -We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted. These are tailored to Southeast Asia. +## Getting Started -The benchmark was introduced here [BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models](https://arxiv.org/abs/2309.06085v2) and [GitHub](https://github.com/aisingapore/bhasa). +To use SEA-LION v2: -## Performance +```python +# Please use transformers==4.43.2 -SEA-LION achieves better or competitive performances on tasks in regional languages: +import transformers +import torch -| Model | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) | -|--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------| -| SEA-LION-7B-Instruct-Research | 24.86 | 76.13 | 24.45 | 52.50 | 46.82 | 15.44 | 33.20 | 23.80 | -| SEA-LION-7B-Instruct | 68.41 | 91.45 | 17.98 | 57.48 | 58.04 | 17.54 | 53.10 | 60.80 | -| SeaLLM 7B v1 | 30.96 | 56.29 | 22.60 | 62.23 | 41.55 | 14.03 | 26.50 | 56.60 | -| SeaLLM 7B v2 | 44.40 | 80.13 | 55.24 | 64.01 | 63.28 | 17.31 | 43.60 | 82.00 | -| Sailor-7B | 65.43 | 59.48 | 20.48 | 64.27 | 60.68 | 8.69 | 15.10 | 38.40 | -| Llama 2 7B Chat | 11.12 | 52.32 | 0.00 | 44.09 | 57.58 | 9.24 | 0.00 | 0.00 | -| Mistral 7B Instruct v0.1 | 38.85 | 74.38 | 20.83 | 30.60 | 51.43 | 15.63 | 28.60 | 50.80 | -| GPT-4 | 73.60 | 74.14 | 63.96 | 69.38 | 67.53 | 18.71 | 83.20 | 96.00 | +model_id = "aisingapore/llama3-8b-cpt-sealionv2-instruct" -SEA-LION has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard): +pipeline = transformers.pipeline( + "text-generation", + model=model_id, + model_kwargs={"torch_dtype": torch.bfloat16}, + device_map="auto", +) +messages = [ + {"role": "user", "content": "Apa sentimen dari kalimat berikut ini?\nKalimat: Buku ini sangat membosankan.\nJawaban: "}, +] -| Model | ARC | HellaSwag | MMLU | TruthfulQA | Average | -|-------------|:-----:|:---------:|:-----:|:----------:|:-------:| -| SEA-LION-7B | 39.93 | 68.51 | 26.87 | 35.09 | 42.60 | +outputs = pipeline( + messages, + max_new_tokens=256, +) +print(outputs[0]["generated_text"][-1]) -For full details on the datasets, metrics, and results, please see the model cards: +``` -* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) -* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) -* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research) -* **LATEST** [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct) +## Performance and Benchmarks -## SEA-LION Demo +SEA-LION achieves better or competitive performances on tasks in regional languages, while retaining the general performance of Llama 3. -A video demo of SEA-LION is available [here](https://aisingapore.github.io/sealion/). +Our [leaderboard is here](https://leaderboard.sea-lion.ai). -## Prompting Guide -A basic prompting guide is provided [here](docs/promptguide.md) +We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted. These are tailored to Southeast Asia. -## Pre-Training Config and Guide +The benchmark was introduced here [BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models](https://arxiv.org/abs/2309.06085v2) and [GitHub](https://github.com/aisingapore/bhasa). -SEA-LION 3B and 7B models are trained on 32 nodes of A100 40GB on AWS EC2. -The configuration used for pre-training and an overview guide is provided [here](pre-training/README-PRE-TRAINING.md). +## Deployment Framework -## QLoRA Fine-Tuning Guide +### Text Generation Inference (TGI) -The SEA-LION models can be fine-tuned using the HuggingFace TRL library. -An overview guide and sample configurations are provided [here](examples/fine-tuning/README.md). +Please refer to [serving the SEA-LION model with TGI](https://github.com/aisingapore/sealion-tgi). -## Deployment Framework +### vLLM -### Text-Generation-Inference (TGI) +Please refer to [serving the SEA-LION model with vLLM](https://github.com/aisingapore/sealion-vllm). -SEA-LION is natively supported in TGI from [v1.4.0](https://github.com/huggingface/text-generation-inference/releases/tag/v1.4.0). +### Ollama -### vLLM +To run SEA-LION locally with Ollama via command line: +1. [Download and install Ollama](https://ollama.com) +2. Run and chat with SEA-LION with the following command + ```python + ollama run aisingapore/llama3-8b-cpt-sea-lionv2-instruct + ``` -For SEA-LION vLLM intergration, please refer to this [guide for instructions](https://github.com/aisingapore/sealion/tree/vllm/vllm). +or [explore SEA-LION with Chainlit and Ollama here](https://github.com/aisingapore/sealion-chainlit-ollama) ## Contributing @@ -152,27 +116,22 @@ Some ways to contribute: - Add more model evaluation tasks and metrics - Train versions of the model in more SEA languages -## License - -SEA-LION is licensed under the [MIT License](LICENSE). - ## To Cite SEA-LION If you use SEA-LION in your work, please cite it as: ```bibtex -@misc{sea_lion_2023, +@misc{sea_lion_2024, title={SEA-LION (Southeast Asian Languages In One Network): A Family of Large Language Models for Southeast Asia}, author={AI Singapore}, - year={2023}, + year={2024}, howpublished={\url{https://github.com/aisingapore/sealion}} } ``` ## Acknowledgements -AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. -Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore. +AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore. ## Contact @@ -190,3 +149,27 @@ For questions, comments, or issues, please open a GitHub issue or contact us via primaryClass={cs.CL} } ``` +# OTHER MODELS + +## SEA-LION v1 + +- 3 to 7 billion parameters +- Instruction tuned in English and Bahasa Indonesia +- Trained with 980B tokens of text data from 11 languages spoken across SEA +- Specialized vocabulary and tokenization for optimal performance on SEA languages +- Excels on tasks in regional languages +- Open source under the MIT License for community contribution and adoption + + +**Base Models** +* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) +* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) + +**Instruction-Tuned Models** +* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research) +* [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct) + +**Model Details** +Please see model cards on Hugging Face. + +Additional information and guides about SEA-LION v1 can be found [here](sea-lion-v1/SEALIONV1_README.md) diff --git a/sea-lion-v1/SEALIONV1_README.md b/sea-lion-v1/SEALIONV1_README.md new file mode 100644 index 0000000..c7b501f --- /dev/null +++ b/sea-lion-v1/SEALIONV1_README.md @@ -0,0 +1,192 @@ +# SEA-LION (Southeast Asian Languages In One Network) + +# A Family of Southeast Asian Language Models + +***Updated: 12 March 2024*** + +SEA-LION is a family of open-source language models developed by AI Singapore that better understands Southeast Asia's diverse contexts, languages, and cultures (SEA). We hope it makes LLMs more accessible and better represents the region's breadth of cultures and languages. + +## Truly Open Source + +We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. We therefore make the following (open-source compliant) contributions: + +1. *Pre-Training* data +2. Model *training* code +3. Model *weights* +4. *Fine-Tuning* data +5. Evaluation *benchmarks* + +## SEA-LION v1 Key Features + +- 3 to 7 billion parameters (larger models to be released through 2024) +- Instruction-tuned in English and Bahasa Indonesia, with more to follow +- Trained on 980B tokens of text data from 11 languages spoken across SEA +- Specialized vocabulary and tokenization for optimal performance on SEA languages +- Excels on tasks in regional languages +- Open source under the MIT License for community contribution and adoption + +## Getting Started + +To use SEA-LION v1 models: + +```python +# please use transformers 4.34.1 +from transformers import AutoTokenizer, AutoModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("aisingapore/sea-lion-3b", trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained("aisingapore/sea-lion-3b", trust_remote_code=True) + +tokens = tokenizer("Sea lion in the sea", return_tensors="pt") +output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id) +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + +### How To Download SEA-LION v1 Models + +SEA-LION v1 models are available for download on HuggingFace at: + +**Base Models** +* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) +* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) + +**Instruction-Tuned** +* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research) +* [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct) + +## SEA-LION v1 Model Details + +SEA-LION v1 is based on the MPT architecture with 32 layers and comes in two sizes: + +- [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) : 3 billion parameters +- [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) : 7 billion parameters +- [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research): 7 billion parameters, instruction-tuned in Bahasa Indonesia +- [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct): 7 billion parameters, instruction-tuned in English and Bahasa Indonesia + +SEA-LION v1 has been trained on a diverse dataset of 980B tokens spanning 11 natural languages: + +- English +- Chinese +- Indonesian +- Malay +- Thai +- Vietnamese +- Filipino +- Tamil +- Burmese +- Khmer +- Lao + +The dataset is available here [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile). + +The models use a vocabulary of 256,000 tokens and a context length of 2048 tokens. For tokenization, the model employs a custom SEA byte-pair encoding (BPE) tokenizer which is specially tailored for SEA languages, ensuring optimal model performance. + +## Benchmark + +We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted. These are tailored to Southeast Asia. + +The benchmark was introduced here [BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models](https://arxiv.org/abs/2309.06085v2) and [GitHub](https://github.com/aisingapore/bhasa). + +## Performance + +SEA-LION v1 achieves better or competitive performances on tasks in regional languages: + +| Model | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) | +|--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------| +| SEA-LION-7B-Instruct-Research | 24.86 | 76.13 | 24.45 | 52.50 | 46.82 | 15.44 | 33.20 | 23.80 | +| SEA-LION-7B-Instruct | 68.41 | 91.45 | 17.98 | 57.48 | 58.04 | 17.54 | 53.10 | 60.80 | +| SeaLLM 7B v1 | 30.96 | 56.29 | 22.60 | 62.23 | 41.55 | 14.03 | 26.50 | 56.60 | +| SeaLLM 7B v2 | 44.40 | 80.13 | 55.24 | 64.01 | 63.28 | 17.31 | 43.60 | 82.00 | +| Sailor-7B | 65.43 | 59.48 | 20.48 | 64.27 | 60.68 | 8.69 | 15.10 | 38.40 | +| Llama 2 7B Chat | 11.12 | 52.32 | 0.00 | 44.09 | 57.58 | 9.24 | 0.00 | 0.00 | +| Mistral 7B Instruct v0.1 | 38.85 | 74.38 | 20.83 | 30.60 | 51.43 | 15.63 | 28.60 | 50.80 | +| GPT-4 | 73.60 | 74.14 | 63.96 | 69.38 | 67.53 | 18.71 | 83.20 | 96.00 | + +SEA-LION v1 has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard): + +| Model | ARC | HellaSwag | MMLU | TruthfulQA | Average | +|-------------|:-----:|:---------:|:-----:|:----------:|:-------:| +| SEA-LION-7B | 39.93 | 68.51 | 26.87 | 35.09 | 42.60 | + +For full details on the datasets, metrics, and results, please see the model cards: + +* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b) +* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b) +* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research) +* [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct) + +## SEA-LION v1 Demo + +A video demo of SEA-LION v1 is available [here](https://aisingapore.github.io/sealion/). + +## SEA-LION v1 Prompting Guide +A basic prompting guide for the SEALION v1 models is provided [here](docs/promptguide.md) + +## SEA-LION v1 Pre-Training Config and Guide + +SEA-LION 3B and 7B v1 models are trained on 32 nodes of A100 40GB on AWS EC2. +The configuration used for pre-training and an overview guide is provided [here](pre-training/README-PRE-TRAINING.md). + +## SEA-LION v1 QLoRA Fine-Tuning Guide + +The SEA-LION v1 models can be fine-tuned using the HuggingFace TRL library. +An overview guide and sample configurations are provided [here](examples/fine-tuning/README.md). + +## SEA-LION v1 Deployment Framework + +### Text-Generation-Inference (TGI) + +SEA-LION is natively supported in TGI from [v1.4.0](https://github.com/huggingface/text-generation-inference/releases/tag/v1.4.0). + +### vLLM + +For SEA-LION vLLM intergration, please refer to this [guide for instructions](https://github.com/aisingapore/sealion/tree/vllm/vllm). + +## Contributing + +We welcome contributions to SEA-LION! Check out the [contributing guide](../CONTRIBUTING.md) to get started. + +Some ways to contribute: + +- Report bugs and issues +- Enhance the documentation +- Add more model evaluation tasks and metrics +- Train versions of the model in more SEA languages + +## SEA-LION v1 Model License + +See Hugging Face for model license details + +## To Cite SEA-LION + +If you use SEA-LION in your work, please cite it as: + +```bibtex +@misc{sea_lion_2023, + title={SEA-LION (Southeast Asian Languages In One Network): A Family of Large Language Models for Southeast Asia}, + author={AI Singapore}, + year={2023}, + howpublished={\url{https://github.com/aisingapore/sealion}} +} +``` + +## Acknowledgements + +AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. +Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore. + +## Contact + +For questions, comments, or issues, please open a GitHub issue or contact us via this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6). + +## References + +```bibtex +@misc{lowphansirikul2021wangchanberta, + title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, + author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong}, + year={2021}, + eprint={2101.09635}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +``` diff --git a/docs/_config.yml b/sea-lion-v1/docs/_config.yml similarity index 100% rename from docs/_config.yml rename to sea-lion-v1/docs/_config.yml diff --git a/docs/index.md b/sea-lion-v1/docs/index.md similarity index 100% rename from docs/index.md rename to sea-lion-v1/docs/index.md diff --git a/docs/promptguide.md b/sea-lion-v1/docs/promptguide.md similarity index 100% rename from docs/promptguide.md rename to sea-lion-v1/docs/promptguide.md diff --git a/docs/sealion_demo.mp4 b/sea-lion-v1/docs/sealion_demo.mp4 similarity index 100% rename from docs/sealion_demo.mp4 rename to sea-lion-v1/docs/sealion_demo.mp4 diff --git a/examples/fine-tuning/README.md b/sea-lion-v1/examples/fine-tuning/README.md similarity index 100% rename from examples/fine-tuning/README.md rename to sea-lion-v1/examples/fine-tuning/README.md diff --git a/examples/fine-tuning/data_functions.py b/sea-lion-v1/examples/fine-tuning/data_functions.py similarity index 100% rename from examples/fine-tuning/data_functions.py rename to sea-lion-v1/examples/fine-tuning/data_functions.py diff --git a/examples/fine-tuning/qlora_fine_tuning.py b/sea-lion-v1/examples/fine-tuning/qlora_fine_tuning.py similarity index 100% rename from examples/fine-tuning/qlora_fine_tuning.py rename to sea-lion-v1/examples/fine-tuning/qlora_fine_tuning.py diff --git a/examples/fine-tuning/requirements.txt b/sea-lion-v1/examples/fine-tuning/requirements.txt similarity index 100% rename from examples/fine-tuning/requirements.txt rename to sea-lion-v1/examples/fine-tuning/requirements.txt diff --git a/examples/fine-tuning/train_config.yaml b/sea-lion-v1/examples/fine-tuning/train_config.yaml similarity index 100% rename from examples/fine-tuning/train_config.yaml rename to sea-lion-v1/examples/fine-tuning/train_config.yaml diff --git a/examples/inference/inference.py b/sea-lion-v1/examples/inference/inference.py similarity index 100% rename from examples/inference/inference.py rename to sea-lion-v1/examples/inference/inference.py diff --git a/examples/requirements.txt b/sea-lion-v1/examples/requirements.txt similarity index 100% rename from examples/requirements.txt rename to sea-lion-v1/examples/requirements.txt diff --git a/pre-training/3B/launch.sh b/sea-lion-v1/pre-training/3B/launch.sh similarity index 100% rename from pre-training/3B/launch.sh rename to sea-lion-v1/pre-training/3B/launch.sh diff --git a/pre-training/3B/launch.slurm b/sea-lion-v1/pre-training/3B/launch.slurm similarity index 100% rename from pre-training/3B/launch.slurm rename to sea-lion-v1/pre-training/3B/launch.slurm diff --git a/pre-training/3B/mpt-3b.yaml b/sea-lion-v1/pre-training/3B/mpt-3b.yaml similarity index 100% rename from pre-training/3B/mpt-3b.yaml rename to sea-lion-v1/pre-training/3B/mpt-3b.yaml diff --git a/pre-training/7B/launch.sh b/sea-lion-v1/pre-training/7B/launch.sh similarity index 100% rename from pre-training/7B/launch.sh rename to sea-lion-v1/pre-training/7B/launch.sh diff --git a/pre-training/7B/launch.slurm b/sea-lion-v1/pre-training/7B/launch.slurm similarity index 100% rename from pre-training/7B/launch.slurm rename to sea-lion-v1/pre-training/7B/launch.slurm diff --git a/pre-training/7B/mpt-7b.yaml b/sea-lion-v1/pre-training/7B/mpt-7b.yaml similarity index 100% rename from pre-training/7B/mpt-7b.yaml rename to sea-lion-v1/pre-training/7B/mpt-7b.yaml diff --git a/pre-training/README-PRE-TRAINING.md b/sea-lion-v1/pre-training/README-PRE-TRAINING.md similarity index 100% rename from pre-training/README-PRE-TRAINING.md rename to sea-lion-v1/pre-training/README-PRE-TRAINING.md