Skip to content

Commit

Permalink
small changes
Browse files Browse the repository at this point in the history
  • Loading branch information
istvan-fodor committed Oct 18, 2024
1 parent e47c71c commit ddd81a3
Show file tree
Hide file tree
Showing 6 changed files with 78 additions and 9 deletions.
25 changes: 25 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
The MIT License (MIT)
=====================

Copyright © `2024` `Istvan Fodor`

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the “Software”), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
Whisper Finetuning for Robot Commands
=====================================
Speech-to-Text Finetuning for Short Robot Commands
==================================================

This repo has two programs.

1. [record](/record) is used to record text and audio commands into parquet files.

2. [finetune](/finetune) is used to finetune a Whisper model with the recorded data.

See each folder how you can use these in conjuction to finetune the Whisper speec-to-text model for your robotics usecase.

This repo has two programs. Once is used for


36 changes: 36 additions & 0 deletions finetune/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Whisper Model Fine-Tuning for Speech Recognition

This project fine-tunes OpenAI's Whisper model on a the recorded dataset. The dataset is processed, tokenized, and trained using Hugging Face's `transformers` library, and the audio data is preprocessed using `pydub`.

Data is loaded from the `../audio/*.parquet` source and you should record files there first with the recording component of this project.

## Requirements

To run this project, you need to install the torch deps first. Use the [Get Started with Torch](https://pytorch.org/get-started/locally/) guide for the right channel based on your setup (CUDA, ROCm, CPU, OSX vs Linux, etc):

```bash
#I personally used this with an AMD card, so installed ROCm:
pip install -r torch-requirements.txt --index-url https://download.pytorch.org/whl/rocm6.2
```

After this step install the rest of the deps.
```bash
pip install -r requirements.txt
```

## Settings

This project finetunes the small Whisper model. Also, the training parameters are ideal for small datasets and small memory. If you use this code, you should play around with the parameters if you have high caliber hardware.

## How it Works

```bash
python whisper_finetune.py
```

Once the program finishes, it will store a checkpoint in the [whisper_finetuned](/whisper_finetuned) from the root. In other applications you can load Whisper from this folder.


## License

This project is licensed under the MIT License.
10 changes: 5 additions & 5 deletions finetune/whisper_finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) ->


dataset = load_dataset('parquet', data_files = '../audio/*.parquet', streaming = False)
print(dataset)
dataset = dataset['train']

# Function to process the audio data
Expand Down Expand Up @@ -91,7 +90,6 @@ def prepare_dataset(batch):
# Set the dataset format for PyTorch
dataset.set_format(type='torch', columns=['input_features', 'labels'])

print(dataset)
train_test = dataset.train_test_split(test_size=0.2)
train_dataset = train_test['train']
eval_dataset = train_test['test']
Expand All @@ -104,19 +102,21 @@ def prepare_dataset(batch):
# Configure the model for English transcription
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")

use_fp16 = torch.cuda.is_available()

# Define the training arguments
training_args = Seq2SeqTrainingArguments(
output_dir="../whisper_finetuned",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=1e-5,
num_train_epochs=3,
logging_steps=10,
save_steps=500,
eval_steps=500,
eval_strategy="steps",
save_total_limit=2,
# fp16=True,
fp16=use_fp16,
predict_with_generate=True,
)

Expand Down
2 changes: 1 addition & 1 deletion record/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The app uses a `system_prompt.txt` file to define the type of instructions gener

### 2. Recording Commands

The app will display the command, and you can record your voice command based on the provided instructions. The recorded data (both text and audio) is stored in a Parquet file in the `audio` folder in the root of the project when you click the **Write to file** button.
The app will display the command, and you can record your voice command based on the provided instructions. The recorded data (both text and audio) is stored in a Parquet file in the [audio](/audio) folder in the root of the project when you click the **Write to file** button.

### 3. Saving Data
Once you're done recording a batch of commands, press the **Write to file** button. This will save the current set of commands and their corresponding audio files into a Parquet file named `whisper_training_data_N.parquet`, where `N` is the collection number to avoid overwriting previous files.
Expand Down
1 change: 1 addition & 0 deletions whisper_finetuned/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*/*

0 comments on commit ddd81a3

Please sign in to comment.