Enhanced Parakeet Transcription

A real-time speech transcription system using Parakeet MLX for Apple Silicon. This tool captures audio from your microphone and provides real-time transcription with word-level timestamps, continuous chunking, and multi-format export capabilities.

Features

Real-time transcription of microphone input
Word-level timestamps to know exactly when each word was spoken
Continuous chunking for longer recordings with overlapping context
Multi-format export (TXT, SRT subtitles, and JSON)
Colorized output for better visualization
Device selection for systems with multiple microphones

Installation

Prerequisites

Python 3.9 or higher
macOS with Apple Silicon (M1, M2, M3, etc.)
A working microphone

Installation with UV (Recommended)

UV is a fast package installer for Python. To install the required dependencies using UV:

# Install uv if you haven't already
curl -fsSL https://astral.sh/uv/install.sh | bash

# Create and activate a new environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install mlx parakeet-mlx sounddevice numpy

Installation with pip

If you prefer using pip instead:

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install mlx parakeet-mlx sounddevice numpy

Usage

Basic Usage

To start transcription with default settings:

python enhanced_transcription.py

List Available Audio Devices

To see all available audio input devices on your system:

python enhanced_transcription.py --list-devices

Select a Specific Audio Device

To use a specific audio input device:

python enhanced_transcription.py --device "Device Name"
# Or use the device number
python enhanced_transcription.py --device 1

Customizing Chunking

To adjust how audio is processed in chunks:

# Disable chunking (process in small segments only)
python enhanced_transcription.py --no-chunking

# Customize chunk duration and overlap
python enhanced_transcription.py --chunk-duration 15 --overlap-duration 3

Specifying Output Formats

Choose which output formats to save:

# Save only as SRT subtitle file
python enhanced_transcription.py --output-format srt

# Save in multiple formats
python enhanced_transcription.py --output-format txt,json

# Save in all formats (default)
python enhanced_transcription.py --output-format all

Command-Line Options

Option	Description	Default
`--device`	Audio input device name or index	System default
`--list-devices`	List all available audio devices and exit	-
`--model`	Parakeet model to use	mlx-community/parakeet-tdt-0.6b-v2
`--no-chunking`	Disable chunking for continuous transcription	False
`--chunk-duration`	Duration of each chunk in seconds	20.0
`--overlap-duration`	Overlap between chunks in seconds	4.0
`--output-dir`	Directory to save transcriptions	transcriptions
`--output-format`	Output format (txt/srt/json/all)	all

How It Works

The enhanced transcription system works through several coordinated components:

1. Audio Capture

The script captures audio from your selected microphone using the sounddevice library. Audio is captured in a non-blocking way and stored in a thread-safe queue for processing.

2. Processing Pipeline

The captured audio goes through a processing pipeline:

Preprocessing: Audio is normalized and converted to the format expected by the model
Feature Extraction: Audio is converted to log-mel spectrograms using Parakeet's preprocessing
Transcription: The Parakeet model generates transcription with timestamps
Post-processing: Results are formatted and displayed/saved

3. Chunking Strategy

For longer recordings, the script uses a chunking strategy:

Audio is processed in overlapping chunks (default: 20 seconds with 4 second overlap)
This allows for continuous transcription while maintaining context between chunks
The overlap helps prevent words from being cut off at chunk boundaries

4. Visualization and Export

The script provides:

Real-time visualization of transcriptions with word-level timestamps
Progress indicators showing Real-Time Factor (RTF) - how fast processing is compared to audio duration
Export capabilities in multiple formats

Components Overview

TranscriptionState Class

Tracks the current state of the transcription process, including:

Latest transcribed text
Current audio chunk being processed
Statistics about chunks processed
Recording start time and duration

Audio Processing Functions

audio_callback: Captures audio from the microphone
process_audio: Main function that processes audio chunks and generates transcriptions
get_logmel: Converts audio to log-mel spectrograms for the model

Visualization and Display

colored: Applies terminal colors for better visualization
display_result: Formats and displays transcription results
get_timestamp_display: Formats timestamps in human-readable format

Export Functions

save_transcriptions: Saves transcriptions in various formats

Troubleshooting

No Audio Detected

If the script doesn't detect any audio:

Check your microphone is working and properly connected
Use --list-devices to verify your audio device is detected
Try selecting a specific device with --device

Poor Transcription Quality

If transcription quality is poor:

Ensure you're in a quiet environment
Speak clearly and at a normal pace
Try adjusting chunk size parameters for your speaking style

Script Crashes or Hangs

If the script crashes or hangs:

Make sure you have sufficient memory available
Try running with shorter chunk durations (--chunk-duration 10)
Update to the latest versions of dependencies

Advanced Usage

Integration with Other Tools

The enhanced_transcription.py script can be integrated with other tools:

Video subtitling: Use the SRT output with video editing software
Speech analysis: Use the JSON output for analyzing speech patterns
Automated documentation: Pipe the TXT output to a documentation generator

Overlaying GIFs onto a Video

The repository includes a small utility (overlay_gif.py) that demonstrates how to place an animated GIF on top of an MP4 file. The script can trim the input video to a specific range and control when the GIF appears and disappears. It relies on the moviepy library.

If you haven't installed moviepy, you can do so with:

pip install moviepy

Basic usage:

python overlay_gif.py --video input.mp4 --gif anim.gif \
    --gif-start 5 --position center --output output.mp4

This inserts anim.gif in the center of input.mp4, starting five seconds into the video, and saves the result as output.mp4.

You can also trim the source video and specify when the GIF disappears:

python overlay_gif.py --video input.mp4 --gif anim.gif \
    --clip-start 10 --clip-end 20 \
    --gif-start 2 --gif-end 8 --position "100,200" \
    --output clipped.mp4

This cuts input.mp4 to the 10–20 second range and overlays anim.gif at the coordinates (100, 200) from two seconds into the clip until the eight-second mark.

Overlaying Animated SVGs from a GIF URL

The script overlay_framesvg.py demonstrates how to take an animated GIF from a URL, convert it to an SVG using the FrameSVG library and then composite that animation over a video. The SVG is rendered frame by frame so it can be placed with moviepy just like a normal image sequence.

Required dependencies can be installed with:

pip install moviepy framesvg cairosvg

Example usage:

python overlay_framesvg.py --video input.mp4 \
    --gif-url https://example.com/anim.gif \
    --gif-start 1 --position center --output output.mp4

This downloads the GIF, converts it to an SVG and overlays the animation in the center of the video starting one second into the clip.

Overlaying Text Bubbles onto a Video

Another utility in this repository, overlay_text_bubble.py, uses the drawsvg library to render simple speech bubbles and place them over an MP4 file with moviepy.

If you haven’t installed these dependencies, do so with:

pip install moviepy drawsvg

Basic usage:

python overlay_text_bubble.py --video input.mp4 --text "Hello!" \
    --start 3 --end 8 --position center --output output.mp4

This shows a speech bubble containing Hello! between the third and eighth second of the video.

You can also customise the bubble size and position:

python overlay_text_bubble.py --video input.mp4 --text "Look" \
    --bubble-width 400 --bubble-height 120 --position "50,200" \
    --output custom.mp4

Customizing the Model

You can use different Parakeet models with the --model parameter:

# Use a different model from Hugging Face
python enhanced_transcription.py --model "mlx-community/parakeet-rnnt-1.1b"

License

This project is available under the MIT License. See the LICENSE file for more details.

Acknowledgments

Parakeet MLX for the excellent speech recognition model
MLX for the machine learning framework optimized for Apple Silicon
Sounddevice for audio capture functionality

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clip_with_bubble.py		clip_with_bubble.py
enhanced_transcription.py		enhanced_transcription.py
overlay_framesvg.py		overlay_framesvg.py
overlay_gif.py		overlay_gif.py
overlay_text_bubble.py		overlay_text_bubble.py
pyproject.toml		pyproject.toml

License

PierreVannier/parakeet-transcript

Folders and files

Latest commit

History

Repository files navigation

Enhanced Parakeet Transcription

Features

Installation

Prerequisites

Installation with UV (Recommended)

Installation with pip

Usage

Basic Usage

List Available Audio Devices

Select a Specific Audio Device

Customizing Chunking

Specifying Output Formats

Command-Line Options

How It Works

1. Audio Capture

2. Processing Pipeline

3. Chunking Strategy

4. Visualization and Export

Components Overview

TranscriptionState Class

Audio Processing Functions

Visualization and Display

Export Functions

Troubleshooting

No Audio Detected

Poor Transcription Quality

Script Crashes or Hangs

Advanced Usage

Integration with Other Tools

Overlaying GIFs onto a Video

Overlaying Animated SVGs from a GIF URL

Overlaying Text Bubbles onto a Video

Customizing the Model

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages