A real-time speech transcription system using Parakeet MLX for Apple Silicon. This tool captures audio from your microphone and provides real-time transcription with word-level timestamps, continuous chunking, and multi-format export capabilities.
- Real-time transcription of microphone input
- Word-level timestamps to know exactly when each word was spoken
- Continuous chunking for longer recordings with overlapping context
- Multi-format export (TXT, SRT subtitles, and JSON)
- Colorized output for better visualization
- Device selection for systems with multiple microphones
- Python 3.9 or higher
- macOS with Apple Silicon (M1, M2, M3, etc.)
- A working microphone
UV is a fast package installer for Python. To install the required dependencies using UV:
# Install uv if you haven't already
curl -fsSL https://astral.sh/uv/install.sh | bash
# Create and activate a new environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
uv pip install mlx parakeet-mlx sounddevice numpy
If you prefer using pip instead:
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install mlx parakeet-mlx sounddevice numpy
To start transcription with default settings:
python enhanced_transcription.py
To see all available audio input devices on your system:
python enhanced_transcription.py --list-devices
To use a specific audio input device:
python enhanced_transcription.py --device "Device Name"
# Or use the device number
python enhanced_transcription.py --device 1
To adjust how audio is processed in chunks:
# Disable chunking (process in small segments only)
python enhanced_transcription.py --no-chunking
# Customize chunk duration and overlap
python enhanced_transcription.py --chunk-duration 15 --overlap-duration 3
Choose which output formats to save:
# Save only as SRT subtitle file
python enhanced_transcription.py --output-format srt
# Save in multiple formats
python enhanced_transcription.py --output-format txt,json
# Save in all formats (default)
python enhanced_transcription.py --output-format all
Option | Description | Default |
---|---|---|
--device |
Audio input device name or index | System default |
--list-devices |
List all available audio devices and exit | - |
--model |
Parakeet model to use | mlx-community/parakeet-tdt-0.6b-v2 |
--no-chunking |
Disable chunking for continuous transcription | False |
--chunk-duration |
Duration of each chunk in seconds | 20.0 |
--overlap-duration |
Overlap between chunks in seconds | 4.0 |
--output-dir |
Directory to save transcriptions | transcriptions |
--output-format |
Output format (txt/srt/json/all) | all |
The enhanced transcription system works through several coordinated components:
The script captures audio from your selected microphone using the sounddevice
library. Audio is captured in a non-blocking way and stored in a thread-safe queue for processing.
The captured audio goes through a processing pipeline:
- Preprocessing: Audio is normalized and converted to the format expected by the model
- Feature Extraction: Audio is converted to log-mel spectrograms using Parakeet's preprocessing
- Transcription: The Parakeet model generates transcription with timestamps
- Post-processing: Results are formatted and displayed/saved
For longer recordings, the script uses a chunking strategy:
- Audio is processed in overlapping chunks (default: 20 seconds with 4 second overlap)
- This allows for continuous transcription while maintaining context between chunks
- The overlap helps prevent words from being cut off at chunk boundaries
The script provides:
- Real-time visualization of transcriptions with word-level timestamps
- Progress indicators showing Real-Time Factor (RTF) - how fast processing is compared to audio duration
- Export capabilities in multiple formats
Tracks the current state of the transcription process, including:
- Latest transcribed text
- Current audio chunk being processed
- Statistics about chunks processed
- Recording start time and duration
audio_callback
: Captures audio from the microphoneprocess_audio
: Main function that processes audio chunks and generates transcriptionsget_logmel
: Converts audio to log-mel spectrograms for the model
colored
: Applies terminal colors for better visualizationdisplay_result
: Formats and displays transcription resultsget_timestamp_display
: Formats timestamps in human-readable format
save_transcriptions
: Saves transcriptions in various formats
If the script doesn't detect any audio:
- Check your microphone is working and properly connected
- Use
--list-devices
to verify your audio device is detected - Try selecting a specific device with
--device
If transcription quality is poor:
- Ensure you're in a quiet environment
- Speak clearly and at a normal pace
- Try adjusting chunk size parameters for your speaking style
If the script crashes or hangs:
- Make sure you have sufficient memory available
- Try running with shorter chunk durations (
--chunk-duration 10
) - Update to the latest versions of dependencies
The enhanced_transcription.py
script can be integrated with other tools:
- Video subtitling: Use the SRT output with video editing software
- Speech analysis: Use the JSON output for analyzing speech patterns
- Automated documentation: Pipe the TXT output to a documentation generator
The repository includes a small utility (overlay_gif.py
) that demonstrates how
to place an animated GIF on top of an MP4 file. The script can trim the input
video to a specific range and control when the GIF appears and disappears. It
relies on the moviepy
library.
If you haven't installed moviepy
, you can do so with:
pip install moviepy
Basic usage:
python overlay_gif.py --video input.mp4 --gif anim.gif \
--gif-start 5 --position center --output output.mp4
This inserts anim.gif
in the center of input.mp4
, starting five seconds into
the video, and saves the result as output.mp4
.
You can also trim the source video and specify when the GIF disappears:
python overlay_gif.py --video input.mp4 --gif anim.gif \
--clip-start 10 --clip-end 20 \
--gif-start 2 --gif-end 8 --position "100,200" \
--output clipped.mp4
This cuts input.mp4
to the 10–20 second range and overlays anim.gif
at the
coordinates (100, 200) from two seconds into the clip until the eight-second
mark.
The script overlay_framesvg.py
demonstrates how to take an animated GIF from
a URL, convert it to an SVG using the
FrameSVG library and then composite that
animation over a video. The SVG is rendered frame by frame so it can be placed
with moviepy
just like a normal image sequence.
Required dependencies can be installed with:
pip install moviepy framesvg cairosvg
Example usage:
python overlay_framesvg.py --video input.mp4 \
--gif-url https://example.com/anim.gif \
--gif-start 1 --position center --output output.mp4
This downloads the GIF, converts it to an SVG and overlays the animation in the center of the video starting one second into the clip.
Another utility in this repository, overlay_text_bubble.py
, uses the
drawsvg
library to render simple
speech bubbles and place them over an MP4 file with moviepy
.
If you haven’t installed these dependencies, do so with:
pip install moviepy drawsvg
Basic usage:
python overlay_text_bubble.py --video input.mp4 --text "Hello!" \
--start 3 --end 8 --position center --output output.mp4
This shows a speech bubble containing Hello! between the third and eighth second of the video.
You can also customise the bubble size and position:
python overlay_text_bubble.py --video input.mp4 --text "Look" \
--bubble-width 400 --bubble-height 120 --position "50,200" \
--output custom.mp4
You can use different Parakeet models with the --model
parameter:
# Use a different model from Hugging Face
python enhanced_transcription.py --model "mlx-community/parakeet-rnnt-1.1b"
This project is available under the MIT License. See the LICENSE file for more details.
- Parakeet MLX for the excellent speech recognition model
- MLX for the machine learning framework optimized for Apple Silicon
- Sounddevice for audio capture functionality