This project aims to fine-tune an llm so that it can understand the Amharic language and create an Advertisement in Amharic given a brand information, product brief. It'll utilize messages exported from 25 publicly available channels to extend the pre-training phase of the model as well as fine-tune the model to generate ads later on.
At this point in time, you'll need the raw data of the channel messages in a directory named data/raw. Then you can follow the following steps to clean the data and make it appropriate for the model:
pip install -r requirements.txt
- inside
parse_and_save.ipynb
, run the functionprocess_raw_data
to get only the necessary data from the raw data which are id, text, date - inside
cleaning.ipynb
run the functionclean_parsed_data
to get the cleaned data which has removed emojis, symbols, newlines, extra spaces
To test the inference of the model being used you'll need to follow this steps:
- Accept Llama2 license on huggingface and download it like this:
- git lfs install
- git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
- Download the amharic finetune from huggingface like this:
- git lfs install
- git clone https://huggingface.co/iocuydi/llama-2-amharic-3784m
- Clone this github repository
- Then inside inference/run_inf.py:
- change the MAIN_PATH to the path to folder you downloaded from step 1
- change the peft_model to the path you cloned in the step 2
- Go to your llama2 folder(from step 1) and replace the tokenizer related files with the one you find from the 2nd step
- set quanitzation=True inside the main function before the load_model function call
- Finally run the inference/run_inf.py file