small changes

istvan-fodor · Oct 18, 2024 · ddd81a3 · ddd81a3
1 parent e47c71c
commit ddd81a3
Show file tree

Hide file tree

Showing 6 changed files with 78 additions and 9 deletions.
diff --git a/LICENSE.md b/LICENSE.md
@@ -0,0 +1,25 @@
+The MIT License (MIT)
+=====================
+
+Copyright © `2024` `Istvan Fodor`
+
+Permission is hereby granted, free of charge, to any person
+obtaining a copy of this software and associated documentation
+files (the “Software”), to deal in the Software without
+restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following
+conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,6 +1,13 @@
-Whisper Finetuning for Robot Commands
-=====================================
+Speech-to-Text Finetuning for Short Robot Commands
+==================================================
+
+This repo has two programs. 
+
+1. [record](/record) is used to record text and audio commands into parquet files.
+
+2. [finetune](/finetune) is used to finetune a Whisper model with the recorded data.
+
+See each folder how you can use these in conjuction to finetune the Whisper speec-to-text model for your robotics usecase.
 
-This repo has two programs. Once is used for 
 
 
diff --git a/finetune/README.md b/finetune/README.md
@@ -0,0 +1,36 @@
+# Whisper Model Fine-Tuning for Speech Recognition
+
+This project fine-tunes OpenAI's Whisper model on a the recorded dataset. The dataset is processed, tokenized, and trained using Hugging Face's `transformers` library, and the audio data is preprocessed using `pydub`.
+
+Data is loaded from the `../audio/*.parquet` source and you should record files there first with the recording component of this project.
+
+## Requirements
+
+To run this project, you need to install the torch deps first. Use the [Get Started with Torch](https://pytorch.org/get-started/locally/) guide for the right channel based on your setup (CUDA, ROCm, CPU, OSX vs Linux, etc):
+
+```bash
+#I personally used this with an AMD card, so installed ROCm:
+pip install -r torch-requirements.txt --index-url https://download.pytorch.org/whl/rocm6.2
+```
+
+After this step install the rest of the deps. 
+```bash
+pip install -r requirements.txt
+```
+
+## Settings
+
+This project finetunes the small Whisper model. Also, the training parameters are ideal for small datasets and small memory. If you use this code, you should play around with the parameters if you have high caliber hardware.
+
+## How it Works
+
+```bash
+python whisper_finetune.py
+```
+
+Once the program finishes, it will store a checkpoint in the [whisper_finetuned](/whisper_finetuned) from the root. In other applications you can load Whisper from this folder.
+
+
+## License
+
+This project is licensed under the MIT License.
diff --git a/finetune/whisper_finetune.py b/finetune/whisper_finetune.py
@@ -40,7 +40,6 @@ def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) ->
 
 
 dataset = load_dataset('parquet', data_files = '../audio/*.parquet', streaming = False)
-print(dataset)
 dataset = dataset['train']
 
 # Function to process the audio data
@@ -91,7 +90,6 @@ def prepare_dataset(batch):
 # Set the dataset format for PyTorch
 dataset.set_format(type='torch', columns=['input_features', 'labels'])
 
-print(dataset)
 train_test = dataset.train_test_split(test_size=0.2)
 train_dataset = train_test['train']
 eval_dataset = train_test['test']
@@ -104,19 +102,21 @@ def prepare_dataset(batch):
 # Configure the model for English transcription
 model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")
 
+use_fp16 = torch.cuda.is_available()
+
 # Define the training arguments
 training_args = Seq2SeqTrainingArguments(
     output_dir="../whisper_finetuned",
-    per_device_train_batch_size=8,
-    gradient_accumulation_steps=2,
+    per_device_train_batch_size=1,
+    gradient_accumulation_steps=8,
     learning_rate=1e-5,
     num_train_epochs=3,
     logging_steps=10,
     save_steps=500,
     eval_steps=500,
     eval_strategy="steps",
     save_total_limit=2,
-    # fp16=True,
+    fp16=use_fp16,
     predict_with_generate=True,
 )
 

diff --git a/record/README.md b/record/README.md
@@ -23,7 +23,7 @@ The app uses a `system_prompt.txt` file to define the type of instructions gener
 
 ### 2. Recording Commands
 
-The app will display the command, and you can record your voice command based on the provided instructions. The recorded data (both text and audio) is stored in a Parquet file in the `audio` folder in the root of the project when you click the **Write to file** button.
+The app will display the command, and you can record your voice command based on the provided instructions. The recorded data (both text and audio) is stored in a Parquet file in the [audio](/audio) folder in the root of the project when you click the **Write to file** button.
 
 ### 3. Saving Data
 Once you're done recording a batch of commands, press the **Write to file** button. This will save the current set of commands and their corresponding audio files into a Parquet file named `whisper_training_data_N.parquet`, where `N` is the collection number to avoid overwriting previous files.

diff --git a/whisper_finetuned/.gitignore b/whisper_finetuned/.gitignore
@@ -0,0 +1 @@
+*/*