Skip to content
/ cabra Public

Fine-tuning OpenLlama-Instruct with portuguese data, for commercial use.

License

Notifications You must be signed in to change notification settings

gustrd/cabra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cabra

Cabra: A portuguese finetuned instruction Open-LLaMA

This repository is intended to share all the steps and resources that we used to finetune our version of Open-LLaMA.

This model is allowed for commercial purposes, by having the license cc-by-sa-3.0. It only used commercially licensed sources.

The project was strongly inspired by Cabrita, which had a similar solution but without using commercially licensed sources, and used a smaller 7B model instead of a 13B.

The resulting models can be downloaded at Hugging Face:

You can test the inference online at Kaggle here.

Benchmarks

We have introduced a novel benchmark to assess the performance of models in English-to-Portuguese translation tasks. Named HIPPO (High-level Interlingual Performance Proximity Optimized), this automated benchmark evaluates not only the absence of grammatical errors but also the semantic closeness of the translation to a reference, using embeddings. You can access the documentation for the HIPPO Benchmark here.

For these models, we utilized the parameters listed below, which we believe can still be optimized:

Results Overview:

Model Fine-tuned? Original Model Allows Commercial Use? HIPPO-Opus100-Grammar HIPPO-Opus100-Paraphrase HIPPO-Opus100-Combined
Llama-7B No N/A No 6.11% 8.91% 2.10%
Alpaca-7B Yes Llama-7B No 32.60% 46.30% 15.25%
Cabrita-7B Yes Llama-7B No 34.00% 33.85% 10.60%
OpenLlama-Instruct-13B Yes OpenLlama-13B Yes 26.35% 32.50% 8.75%
Cabra-13B (ours) Yes OpenLlama-Instruct-13B Yes 35.75% 37.75% 12.40%
LibreTranslate (reference) Not-LLM Not-LLM Yes 44.55% 66.40% 27.90%
  • Fine-tuning models specifically for the target language significantly improves their capability.
  • Alpaca-7B showcases remarkable efficiency even without Portuguese-specific fine-tuning. This could be attributed to the superiority of the Alpaca Dataset compared to Dolly or possibly because Llama's foundational training surpasses that of OpenLlama.
  • Cabrita made improvements but lags behind Alpaca-7B. This might be due to the quick implementation of LoRA fine-tuning.
  • Cabra, despite its gains, still offers considerable room for enhancement, especially given its higher parameter count.
  • LibreTranslate, taken as a reference, faces challenges with this benchmark as well, indicating the complexities and nuances of certain translation tasks.

References

If I have seen further it is by standing on the sholders [sic] of Giants. -- Isaac Newton

We started this section with this citation because everything we accomplished was only possible due to the strong community and the work that other people and groups have done. For our work, we mainly rely on the works developed by: Cabrita, OpenLlama, Hugging Face, Open-Instruct, VMware, Dolly and LibreTranslate .So, thank you all for the great work and contribution to the open-source community!

Data

We translated the databricks-dolly-15k.jsonl into Portuguese using LibreTranslate. Even if this translation is flawed, the tradeoff between speed, cost and results is good. We were able to do it in a few hours on a consumer-grade laptop. The translation script is at this repository.

If you want to know more about how the dataset was built go to: Dolly.

We refrained from using GPT for translation, as the original Cabrita project did, due to licensing issues. Also, the recently released Llama 2 model is currently not licensed for use in Portuguese.

Finetuning

To fine-tune the OpenLLaMA model, we used as base the code available on Cabrita, which provides code to finetune the LLaMA model using PEFT from Hugging Face, we just adapted it to OpenLlama-Instruct. With this, we could run our finetuning step using 1 A4000 at Paperspace on top of OpenLLaMA-OpenInstruct-13B. We trained for about 7 hours and we found the results to be pretty incredible in just that amount of time. The notebook we used is avaible here.

Example outputs

open-llama-13b-open-instruct:

Instruction: Invente uma desculpa criativa pra dizer que não preciso ir à festa.
Response: Sentimento de culpa. Eu não quero ir à festa, pois esse tipo de evento é um local em que as pessoas se reúnem para ter aulas de dança e canto, e também para se divertir e socializar. Mas eu não quero ir, pois estou comum ao público e não quero expor a minha alma ao risco de ser criticada ou ridicularizada.

Cabra:

Instrução: 
Invente uma desculpa criativa pra dizer que não preciso ir à festa.

Resposta: 
Tenho problemas financeiros.

open-llama-13b-open-instruct:

Instruction: Como posso começar a carreira como cientista de dados? Escreva na forma de lista.
Response:
1. Obtenha um bacharelado em ciência da computação ou em engenharia da computação.
2. Obtenha um mestrado em ciência da computação ou em engenharia da computação com ênfase em inteligência artificial.
3. Comece a trabalhar em um campo de interesse, como a mensagem, o negócio ou a ciência.
4. Participação em eventos de ciência da computação, como conferências, workshops e congressos.
5. Aplique para bolsas de pesquisa e desenvolvimento.
6. Aplique para o Programa de Pós-Doutoramento em Ciência da Computação da Microsoft.
7. Aplique para o Programa de Pós-Doutoramento em Ciência da Computação da Facebook.
8. Aplique para o Programa de Pós-Doutoramento em Ciência da Computação da Google.
9. Aplique para o Programa de Pós-Doutoramento em Ciência da Computação da Amazon.

Cabra:

Instrução: 
Como posso começar a carreira como cientista de dados? Escreva na forma de lista.

Resposta: 
- Estudar matemática, ciência da computação e engenharia. 
- Aprender a usar um framework de desenvolvimento de software que você goste. 
- Aprendendo sobre os conceitos de análise de dados e análises estatísticas. 
- Aprender a fazer visualizações de dados e apresentações. 
- Fazer programas para realizar processamentos de dados. 
- Selecionando uma área de interesse e trabalhando nela. 
- Desenvolver habilidades de comunicação e organização que são importantes para trabalho de dados. 
- Trabalhando em projetos em grupo ou sozinho.

You can download the eval notebook to test here.

Recent Advancements

  • 2023-08-07: Introduced the HIPPO-Opus100 benchmark to assess the quality of each model's fine-tuning.

Upcoming Objectives

  • Language Impact on Prompt Structure: Investigate the differential impacts of using the Alpaca prompt structure in English compared to Portuguese.
  • Efficiency Analysis: Examine the efficiency of Emotional Stimuly in model performance and response.
  • Dataset Enhancement: We aim to improve the Portuguese dataset using advanced translation tools. Our current strategy involves leveraging the open-instruct-v1-oasst-dolly-hhrlhf dataset, combined with automatic translations from MarinMT and LibreTranslate. The subsequent dataset will undergo HIPPO filtering to eliminate rows containing grammatical errors or those with suboptimal semantic proximity.
  • Model Fine-Tuning: As new versions of OpenLlama are launched, we plan to fine-tune our larger models accordingly.
  • Alternative Model Study: We are researching alternatives to OpenLlama that are not only efficient but also permit commercial usage, such as the MPT.

Authors

About

Fine-tuning OpenLlama-Instruct with portuguese data, for commercial use.

Resources

License

Stars

Watchers

Forks

Releases

No releases published