The official Python SDK for the Hathora AI API. Easily integrate speech-to-text (STT), text-to-speech (TTS), and large language models (LLM) into your Python applications.
- Simple, intuitive API - Clean, Pythonic interface
- Multiple AI models:
- TTS: Kokoro-82M and ResembleAI Chatterbox
- STT: Parakeet multilingual transcription
- LLM: Qwen3-30B for chat completions
- Model-specific parameters - Each model has its own unique parameters with validation
- Voice cloning with ResembleAI's audio prompt feature
- Flexible audio handling - Works with file paths, file objects, or raw bytes
- Chat completions with message history and temperature control
- Type hints for better IDE support
- Comprehensive error handling
| Model | Parameters | Description |
|---|---|---|
| Parakeet | file |
Audio file to transcribe (required, positional) |
start_time |
Start time in seconds for transcription window (optional) | |
end_time |
End time in seconds for transcription window (optional) |
Example:
# Basic usage
client.speech_to_text.convert("parakeet", "audio.wav")
# With time window
client.speech_to_text.convert("parakeet", "audio.wav", start_time=3.0, end_time=9.0)| Model | Parameters | Description |
|---|---|---|
| Kokoro | voice |
Voice ID (default: "af_bella") |
speed |
Speech speed multiplier: 0.5-2.0 (default: 1.0) | |
| ResembleAI | audio_prompt |
Reference audio file for voice cloning (optional) |
exaggeration |
Emotional intensity: 0.0-1.0 (default: 0.5) | |
cfg_weight |
Adherence to reference voice: 0.0-1.0 (default: 0.5) |
Install from PyPI:
pip install hathoraOr install from source:
git clone https://github.com/hathora/yapp-sdk.git
cd yapp-sdk
pip install -e .import hathora
# Initialize the client
client = hathora.Hathora(api_key="your-api-key")
# Transcribe audio to text
transcription = client.speech_to_text.convert("parakeet", "audio.wav")
print(transcription.text)
# Generate speech from text
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")You can provide your API key in two ways:
client = hathora.Hathora(api_key="your-api-key")export HATHORA_API_KEY="your-api-key"client = hathora.Hathora() # Will use HATHORA_API_KEY from environmentThe SDK uses the Parakeet multilingual STT model for transcription.
import hathora
client = hathora.Hathora(api_key="your-api-key")
# Transcribe an entire audio file using Parakeet
response = client.speech_to_text.convert("parakeet", "audio.wav")
print(response.text)# Transcribe only a specific time range
response = client.speech_to_text.convert(
"parakeet", # Model (positional)
"audio.wav", # File (positional)
start_time=3.0, # Start at 3 seconds
end_time=9.0 # End at 9 seconds
)
print(response.text)The SDK automatically handles various audio formats:
# From file path (string)
response = client.speech_to_text.convert("parakeet", "audio.wav")
# From pathlib.Path
from pathlib import Path
response = client.speech_to_text.convert("parakeet", Path("audio.mp3"))
# From file object
with open("audio.wav", "rb") as f:
response = client.speech_to_text.convert("parakeet", f)
# From bytes
audio_bytes = open("audio.wav", "rb").read()
response = client.speech_to_text.convert("parakeet", audio_bytes)Kokoro parameters: voice, speed
import hathora
client = hathora.Hathora(api_key="your-api-key")
# Simple synthesis (uses defaults)
response = client.text_to_speech.convert(
"kokoro", # Model first
"Hello world!"
)
response.save("output.wav")
# With custom voice and speed
response = client.text_to_speech.convert(
"kokoro", # Model first
"The quick brown fox jumps over the lazy dog.",
voice="af_bella", # Kokoro parameter
speed=1.2 # Kokoro parameter - 20% faster
)
response.save("output_fast.wav")
# Or use the kokoro() method directly
response = client.text_to_speech.kokoro(
text="Direct method call",
voice="af_bella",
speed=0.8 # 20% slower
)
response.save("output_slow.wav")ResembleAI parameters: audio_prompt, exaggeration, cfg_weight
# Simple generation
response = client.text_to_speech.convert(
"resemble", # Model first
"Hello world!",
exaggeration=0.5, # Emotional intensity (0.0 - 1.0)
cfg_weight=0.5 # Adherence to reference voice (0.0 - 1.0)
)
response.save("output.wav")
# Voice cloning with audio prompt
response = client.text_to_speech.convert(
"resemble", # Model first
"This should sound like the reference voice.",
audio_prompt="reference_voice.wav", # Reference audio for cloning
cfg_weight=0.9 # High adherence to reference
)
response.save("cloned_voice.wav")
# Highly expressive speech
response = client.text_to_speech.convert(
"resemble", # Model first
"Wow! This is amazing!",
exaggeration=0.9, # High emotional intensity
cfg_weight=0.5
)
response.save("expressive.wav")
# Or use the resemble() method directly
response = client.text_to_speech.resemble(
text="Direct method call",
audio_prompt="reference.wav",
exaggeration=0.7,
cfg_weight=0.8
)
response.save("output.wav")The SDK provides methods to discover what parameters are available for each TTS model:
# Parakeet (STT) parameters
# Model: parakeet
# Parameters:
# - file (required): Audio file to transcribe
# - start_time (optional): Start time in seconds
# - end_time (optional): End time in seconds
client.speech_to_text.convert("parakeet", "audio.wav", start_time=0, end_time=10)
# List all available TTS models
models = client.text_to_speech.list_models()
print(models) # ['kokoro', 'resemble']
# Print help for a specific TTS model
client.text_to_speech.print_model_help("kokoro")
# Output:
# Model: kokoro
# Parameters:
# - voice (str, default='af_bella'): Voice to use for synthesis
# - speed (float, default=1.0): Speech speed multiplier (0.5 = half speed, 2.0 = double speed)
client.text_to_speech.print_model_help("resemble")
# Output:
# Model: resemble
# Parameters:
# - audio_prompt (AudioFile, default=None): Reference audio file for voice cloning (optional)
# - exaggeration (float, default=0.5): Emotional intensity, range 0.0-1.0
# - cfg_weight (float, default=0.5): Adherence to reference voice, range 0.0-1.0
# Get parameter specifications programmatically
params = client.text_to_speech.get_model_parameters("kokoro")
for param_name, param_info in params.items():
print(f"{param_name}: {param_info['description']}")The SDK validates that you're using the correct parameters for each model:
# This works - correct Kokoro parameters
response = client.text_to_speech.convert(
"kokoro", "Hello", voice="af_bella", speed=1.2
)
# This raises ValidationError with helpful message
try:
response = client.text_to_speech.convert(
"resemble", "Hello", speed=1.2 # ERROR!
)
except ValidationError as e:
print(e)
# Output: Unknown parameters for ResembleAI model: speed.
# Valid parameters: audio_prompt, exaggeration, cfg_weight
# Use client.text_to_speech.print_model_help('resemble') for more details.
# This also raises ValidationError
response = client.text_to_speech.convert(
"kokoro", "Hello", exaggeration=0.5 # ERROR!
)# Save to file
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")
# Or use stream_to_file (alias for save)
response.stream_to_file("output.wav")
# Get raw bytes
audio_bytes = response.content
print(f"Generated {len(audio_bytes)} bytes")
# Check content type
print(response.content_type) # e.g., "audio/wav"The SDK supports chat completions with Qwen and other LLMs.
import hathora
client = hathora.Hathora(api_key="your-api-key")
# Configure your LLM endpoint
client.llm.set_endpoint("https://your-app.app.hathora.dev")# Simple question
response = client.llm.chat("qwen", "What is Python?")
print(response.content)# Conversation with context
messages = [
{"role": "user", "content": "Hello! Can you help me with programming?"},
{"role": "assistant", "content": "Of course! I'd be happy to help."},
{"role": "user", "content": "What's the difference between a list and tuple?"}
]
response = client.llm.chat(
"qwen",
messages,
max_tokens=500,
temperature=0.7
)
print(response.content)# Creative output (higher temperature)
response = client.llm.chat(
"qwen",
"Write a poem about AI",
temperature=0.9,
max_tokens=200
)
# Precise output (lower temperature)
response = client.llm.chat(
"qwen",
"Calculate 15 * 23",
temperature=0.1,
max_tokens=50
)from hathora.resources.llm import ChatMessage
conversation = [
ChatMessage("system", "You are a helpful coding assistant."),
ChatMessage("user", "How do I read a file in Python?")
]
response = client.llm.chat("qwen", conversation)
print(response.content)response = client.llm.chat("qwen", "Explain machine learning")
# Get the response text
print(response.content)
# Get the full message object
print(response.message)
# Get token usage info
print(response.usage)
# Get the model used
print(response.model)
# Get raw response data
print(response.raw)# List all LLM models
models = client.llm.list_models()
print(models) # ['qwen']
# Get model info
info = client.llm.get_model_info("qwen")
print(info)
# Print model help
client.llm.print_model_help("qwen")Main client class for the Hathora API.
Parameters:
api_key(str, optional): Your Hathora API keytimeout(int, default=30): Request timeout in seconds
Properties:
speech_to_text: Speech-to-text (STT) resource for audio transcriptiontext_to_speech: Text-to-speech (TTS) resource for audio synthesis
Transcribe audio to text using the Parakeet STT model.
Parameters:
model(str): STT model to use (currently: "parakeet") - positional, requiredfile(str | Path | BinaryIO | bytes): Audio file to transcribe - positional, requiredstart_time(float, optional): Start time in seconds for transcription windowend_time(float, optional): End time in seconds for transcription window**kwargs: Additional model-specific parameters (reserved for future use)
Example:
# Both model and file are positional
response = client.speech_to_text.convert("parakeet", "audio.wav")Available Models:
"parakeet"- nvidia/parakeet-tdt-0.6b-v3 - Multilingual ASR with word-level timestamps
Returns: TranscriptionResponse
.text: The transcribed text.metadata: Additional metadata from the API (may include word-level timestamps)
Supported audio formats: WAV, MP3, MP4, M4A, OGG, FLAC, PCM
Generate speech from text. This is a unified interface that routes to the appropriate model.
Parameters:
model(str): Model to use ("kokoro" or "resemble") - required, first parametertext(str): Text to convert to speech**kwargs: Model-specific parameters (see below)
Model-Specific Parameters:
For Kokoro model:
voice(str, default="af_bella"): Voice to use for synthesisspeed(float, default=1.0): Speech speed multiplier (0.5 = half speed, 2.0 = double speed)
For ResembleAI model:
audio_prompt(str | Path | BinaryIO | bytes, optional): Reference audio for voice cloningexaggeration(float, default=0.5): Emotional intensity, range 0.0-1.0cfg_weight(float, default=0.5): Adherence to reference voice, range 0.0-1.0
Returns: AudioResponse
Examples:
# Kokoro - model comes first!
response = client.text_to_speech.convert(
"kokoro", "Hello", voice="af_bella", speed=1.2
)
# ResembleAI - model comes first!
response = client.text_to_speech.convert(
"resemble", "Hello", exaggeration=0.7, cfg_weight=0.6
)See also: Use print_model_help() to discover parameters
List all available TTS models.
Returns: list - List of model names
Example:
models = client.text_to_speech.list_models()
print(models) # ['kokoro', 'resemble']Get parameter specifications for a specific model.
Parameters:
model(str): Model name
Returns: dict - Parameter specifications with types, defaults, and descriptions
Example:
params = client.text_to_speech.get_model_parameters("kokoro")
for name, info in params.items():
print(f"{name}: {info['description']}")Print helpful information about a model's parameters to console.
Parameters:
model(str): Model name
Example:
client.text_to_speech.print_model_help("kokoro")
# Prints:
# Model: kokoro
# Parameters:
# - voice (str, default='af_bella'): Voice to use for synthesis
# - speed (float, default=1.0): Speech speed multiplier...Generate speech using the Kokoro-82M model.
Parameters:
text(str): Text to convert to speechvoice(str, default="af_bella"): Voice to usespeed(float, default=1.0): Speech speed multiplier
Returns: AudioResponse
Generate speech using ResembleAI Chatterbox with voice cloning.
Parameters:
text(str): Text to convert to speechaudio_prompt(str | Path | BinaryIO | bytes, optional): Reference audio for voice cloningexaggeration(float, default=0.5): Emotional intensity (0.0 - 1.0)cfg_weight(float, default=0.5): Adherence to reference voice (0.0 - 1.0)
Returns: AudioResponse
Response object containing generated audio.
Properties:
content: Raw audio bytescontent_type: MIME type of the audio
Methods:
save(file_path): Save audio to filestream_to_file(file_path): Alias forsave()
Response object containing transcribed text.
Properties:
text: The transcribed textmetadata: Additional metadata
import hathora
# Initialize client
client = hathora.Hathora(api_key="your-api-key")
# 1. Transcribe audio
transcription = client.speech_to_text.convert(
"parakeet", # Model (positional)
"original.wav", # File (positional)
start_time=0,
end_time=10
)
print(f"Original: {transcription.text}")
# 2. Modify the text
modified_text = transcription.text.upper()
# 3. Generate new speech with Kokoro
response = client.text_to_speech.convert(
"kokoro", modified_text, voice="af_bella", speed=1.0
)
response.save("output_kokoro.wav")
# 4. Clone voice from original audio
cloned = client.text_to_speech.convert(
"resemble", "New text in the original voice",
audio_prompt="original.wav", cfg_weight=0.9
)
cloned.save("cloned_voice.wav")The SDK provides specific exception types for different error scenarios:
from yapp import HathoraError, APIError, AuthenticationError, ValidationError
try:
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")
except AuthenticationError as e:
print(f"Authentication failed: {e}")
except ValidationError as e:
print(f"Invalid parameters: {e}")
except APIError as e:
print(f"API error (status {e.status_code}): {e.message}")
except HathoraError as e:
print(f"Hathora SDK error: {e}")- WAV (.wav)
- MP3 (.mp3)
- MP4 Audio (.mp4, .m4a)
- OGG (.ogg)
- FLAC (.flac)
- PCM (.pcm)
- WAV (default output format)
cd examples
python discover_parameters.py # Learn about model parameters
python transcribe_audio.py # Speech-to-text examples
python synthesize_speech.py # Text-to-speech examples
python voice_cloning.py # Voice cloning with ResembleAI
python model_parameters.py # Model-specific parameter examples
python full_workflow.py # Complete workflowgit clone https://github.com/hathora/yapp-sdk.git
cd yapp-sdk
pip install -e .- Add streaming support for real-time TTS
- Support for additional TTS models
- Async client support
- Audio format conversion utilities
- Batch processing capabilities
- WebSocket support for real-time conversations
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details.
For issues and questions:
- GitHub Issues: https://github.com/hathora/yapp-sdk/issues
- Documentation: https://docs.hathora.com
- Email: [email protected]