Skip to content

feat(openai): Add OpenAI Transcriptions support with comprehensive testing #8361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

christian-bromann
Copy link
Contributor

This PR adds complete OpenAI Transcriptions functionality to LangChain.js, providing full support for OpenAI's Whisper and GPT transcription models with advanced audio processing capabilities.

🎤 Key Features

Core transcription class (OpenAITranscriptions):

  • Support for Whisper (whisper-1) and GPT transcription models (gpt-4o-mini-transcribe, gpt-4o-transcribe)
  • Multiple audio format detection (MP3, WAV, FLAC, OGG, AAC, MP4, WEBM)
  • Automatic filename inference from audio signatures
  • ID3 tag stripping for MP3 files to ensure proper processing
  • Model-specific response format constraints with TypeScript enforcement
  • Request-level option overrides for fine-grained control

Audio format detection:

  • Automatic MIME type detection from byte signatures
  • Support for Buffer, File, Uint8Array, and Blob inputs
  • Robust error handling for unknown formats
  • Smart filename inference when not provided

Type safety:

  • Smart TypeScript response types based on model and format selection
  • Model-specific configuration constraints (e.g., GPT models only support "text" and "json" formats)
  • Comprehensive type definitions for all supported formats
  • Critical type tests using Vitest's assertType to ensure compile-time type safety

Comprehensive test suite:

  • Integration tests for format detection and error handling
  • Input validation and options testing
  • Model-specific behavior verification
  • Type-level tests to validate TypeScript constraints

Package integration:

  • Export transcription functionality from main package
  • Updated build configuration for new module structure
  • Complete documentation with practical usage examples

🔧 Testing Infrastructure Changes

Migrated from Jest to Vitest to properly mock the OpenAI library and support advanced type testing:

  • Enhanced mocking capabilities: Vitest provides superior mocking for the OpenAI SDK, allowing proper isolation of external dependencies during testing
  • Type testing support: Added .test-d.ts files using Vitest's assertType utility for compile-time type checking
  • Critical for this model: The transcription functionality relies heavily on TypeScript's type system to enforce model-specific constraints (e.g., response format limitations), making type tests essential for ensuring correctness
  • Better async handling: Vitest's modern async/await support provides more reliable testing for the audio processing pipeline

📚 Implementation Details

The implementation follows LangChain patterns with proper serialization, secret management, and async calling support. Includes detailed JSDoc documentation with practical usage examples.

Example Usage:

import { OpenAITranscriptions } from "@langchain/openai";

// Basic transcription
const transcriber = new OpenAITranscriptions({
  model: "whisper-1",
  response_format: "verbose_json"
});

const result = await transcriber.transcribe({
  audio: audioBuffer,
  options: {
    language: "en",
    temperature: 0.2,
    timestamp_granularities: ["word", "segment"]
  }
});

// TypeScript knows the exact response type
console.log(result.text, result.words, result.segments);

The type system ensures that incompatible combinations (like using verbose_json with GPT models) are caught at compile time, preventing runtime errors and improving developer experience.

…sting

Add complete OpenAI Transcriptions functionality including:

- **Core transcription class** (`OpenAITranscriptions`):
  - Support for Whisper and GPT transcription models
  - Multiple audio format detection (MP3, WAV, FLAC, OGG, AAC, MP4)
  - Automatic filename inference from audio signatures
  - ID3 tag stripping for MP3 files
  - Model-specific response format constraints
  - Request-level option overrides

- **Audio format detection**:
  - Automatic MIME type detection from byte signatures
  - Support for Buffer, File, Uint8Array, and Blob inputs
  - Robust error handling for unknown formats

- **Type safety**:
  - Smart TypeScript response types based on model and format
  - Model-specific configuration constraints
  - Comprehensive type definitions for all supported formats

- **Comprehensive test suite**:
  - Integration tests for format detection and error handling
  - Input validation and options testing
  - Model-specific behavior verification

- **Package integration**:
  - Export transcription functionality from main package
  - Updated build configuration for new module structure

The implementation follows LangChain patterns with proper serialization,
secret management, and async calling support. Includes detailed JSDoc
documentation with practical usage examples.
Copy link

vercel bot commented Jun 13, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-docs ❌ Failed (Inspect) Jun 13, 2025 6:55pm
1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ⬜️ Ignored (Inspect) Jun 13, 2025 6:55pm

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 13, 2025
@dosubot dosubot bot added the auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features label Jun 13, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 13, 2025
Copy link
Contributor

@hntrl hntrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @christian-bromann!

Supporting transcription in this way is evergreen I'm thinking and it might need to be more involved/ an evaluation to see if it's worth it to us. Part of the value sell with langchain is optionality between models, so where we don't have a common abstraction for speech-to-text (like we do with chat models) we'd need to see if this is something we want to support long term. It would be one thing if we could surface traces from this class, but we typically do that in the lower abstraction layers (like where ChatOpenAI extends a base class). The "langchain specific" parameters that you have here (lc_secrets, lc_aliases) are specifically for tracing, but those values get utilized further in the inheritance chain so here they are kind of acting as filler.

I wonder if an alternate form factor for this would be if we surfaced whisper as structured tools instead of its own model class, but even then it may make more sense to offshore that to a community package rather than bringing it under the umbrella of this repo (up for debate I imagine).

As for vitest -- this is something we want to use in langchain (we already use it in langgraph!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants