Skip to content

Conversation

rishisurana-labelbox
Copy link

@rishisurana-labelbox rishisurana-labelbox commented Sep 8, 2025

Description

This PR introduces Audio Temporal Annotations - a new feature that enables precise time-based annotations for audio files in the Labelbox SDK. This includes support for temporal classification annotations with millisecond-level timing precision.

Motivation: Audio annotation workflows require precise timing control for applications like:

  • Podcast transcription with speaker identification
  • Call center quality analysis with word-level annotations
  • Music analysis with temporal classifications
  • Sound event detection with precise timestamps

Context: This feature extends the existing audio annotation infrastructure to support temporal annotations, using a millisecond-based timing system that provides the precision needed for audio applications while maintaining compatibility with the existing NDJSON serialization format.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Document change (fix typo or modifying any markdown files, code comments or anything in the examples folder only)

All Submissions

  • Have you followed the guidelines in our Contributing document?
  • Have you provided a description?
  • Are your changes properly formatted?

New Feature Submissions

  • Does your submission pass tests?
  • Have you added thorough tests for your new feature?
  • Have you commented your code, particularly in hard-to-understand areas?
  • Have you added a Docstring?

Changes to Core Features

  • Have you written new tests for your core changes, as applicable?
  • Have you successfully run tests with your changes locally?
  • Have you updated any code comments, as applicable?

Summary of Changes

New Audio Temporal Annotation Types

  • AudioClassificationAnnotation: Time-based classifications (radio, checklist, text) for audio segments
  • Millisecond-based timing: Direct millisecond input for precise timing control
  • INDEX scope support: Temporal classifications use INDEX scope for frame-based annotations

Core Infrastructure Updates

  • Temporal processor: Added support for audio temporal annotations in NDJSON serialization
  • Frame-based organization: Audio annotations organized by millisecond frames for efficient processing
  • MAL compatibility: Audio temporal annotations work with Model-Assisted Labeling pipeline

Testing

  • Updated test cases: Enhanced test coverage for audio temporal annotation functionality
  • Integration tests: Audio temporal annotations work with existing import/export pipelines
  • Edge case testing: Precision testing for millisecond timing and mixed annotation types

Documentation & Examples

  • Updated example notebook: Enhanced audio.ipynb with temporal annotation examples
  • Demo script: Added demo_audio_token_temporal.py showing per-token temporal annotations
  • Use case examples: Word-level speaker identification and temporal classifications
  • Best practices: Guidelines for ontology setup with INDEX scope

Serialization & Import Support

  • NDJSON format: Audio temporal annotations serialize to standard NDJSON format
  • Import pipeline: Full support for audio temporal annotation imports via MAL and Label Import
  • Frame metadata: Millisecond timing preserved in serialized format
  • Backward compatibility: Existing audio annotation workflows unchanged

Key Features

Precise Timing Control

# Millisecond-based timing for precise audio annotation
speaker_annotation = lb_types.AudioClassificationAnnotation(
    frame=2500,  # 2.5 seconds
    end_frame=4100,  # 4.1 seconds
    name="speaker_id",
    value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name="john"))
)

Per-Token Temporal Annotations

# Word-level temporal annotations
tokens_data = [
    ("Hello", 586, 770),    # Hello: frames 586-770
    ("GPT", 771, 955),      # GPT: frames 771-955  
    ("what", 956, 1140),    # what: frames 956-1140
]

temporal_annotations = []
for token, start_frame, end_frame in tokens_data:
    token_annotation = lb_types.AudioClassificationAnnotation(
        frame=start_frame,
        end_frame=end_frame,
        name="User Speaker",
        value=lb_types.Text(answer=token)
    )
    temporal_annotations.append(token_annotation)

Ontology Setup for Temporal Annotations

# INDEX scope required for temporal classifications
ontology_builder = lb.OntologyBuilder(classifications=[
    lb.Classification(
        class_type=lb.Classification.Type.TEXT,
        name="User Speaker",
        scope=lb.Classification.Scope.INDEX,  # INDEX scope for temporal
    ),
])

Label Integration

# Temporal annotations work seamlessly with existing Label infrastructure
label = lb_types.Label(
    data={"global_key": "audio_file.mp3"},
    annotations=[text_annotation, checklist_annotation, radio_annotation] + temporal_annotations
)

# Upload via MAL
upload_job = lb.MALPredictionImport.create_from_objects(
    client=client,
    project_id=project.uid,
    name=f"temporal_mal_job-{str(uuid.uuid4())}",
    predictions=[label],
)

This feature enables the Labelbox SDK to support precise temporal audio annotation workflows while maintaining the same high-quality developer experience as existing audio annotation features.

How to test

  • Ensure you have this branch pulled down and running on python-monorepo (prediction import and worker): rishi/ptdt-3807/temporal-audio-prelabel
  • Run the script below - you will need to set up a venv for this separately
Python script and requirements.txt
#!/usr/bin/env python3

"""
Demo Audio Token Temporal Annotations
Creates temporal annotations for individual spoken tokens with speaker classification.
This demonstrates word-level temporal annotation for a conversation demo.
"""

import labelbox as lb
import uuid
import labelbox.types as lb_types

# Configuration - Update these for your local environment
api_key = "<REDACTED>"
rest_endpoint = "http://localhost:3000/api/v1"

print("🎵 Demo Audio Token Temporal Annotations")
print("=" * 50)

# Initialize client
client = lb.Client(
    api_key=api_key,
    endpoint=endpoint,
    rest_endpoint=rest_endpoint
)

# Step 1: Create dataset and upload audio asset
print("\n📁 Step 1: Creating dataset and uploading audio asset...")
global_key = f"audio-token-demo-{str(uuid.uuid4())}"
asset = {
    "row_data": "https://storage.googleapis.com/lb-artifacts-testing-public/audio/gpt_how_can_you_help_me.wav",
    "global_key": global_key,
    "media_type": "AUDIO",
}

dataset = client.create_dataset(
    name=f"audio_token_demo_dataset_{str(uuid.uuid4())[:8]}",
    iam_integration=None
)

task = dataset.create_data_rows([asset])
task.wait_till_done()
print(f"✅ Dataset created and audio uploaded")
print(f"   - Global key: {global_key}")
print(f"   - Failed data rows: {task.failed_data_rows}")

# Step 2: Create ontology with separate speaker classifications
print("\n🛠️ Step 2: Creating ontology with separate User/Assistant speaker classifications...")
ontology_builder = lb.OntologyBuilder(
    tools=[
        # No tools needed for audio temporal - classifications handle everything
    ],
    classifications=[
        # User Speaker classification (INDEX scope for temporal, TEXT input for tokens)
        lb.Classification(
            class_type=lb.Classification.Type.TEXT,
            name="User Speaker",
            scope=lb.Classification.Scope.INDEX,  # KEY: INDEX scope for temporal
        ),
        
        # Assistant Speaker classification (INDEX scope for temporal, TEXT input for tokens) 
        lb.Classification(
            class_type=lb.Classification.Type.TEXT,
            name="Assistant Speaker", 
            scope=lb.Classification.Scope.INDEX,  # KEY: INDEX scope for temporal
        ),
        
        # Global classifications (non-temporal)
        lb.Classification(
            class_type=lb.Classification.Type.RADIO,
            name="overall_quality",
            scope=lb.Classification.Scope.GLOBAL,  # Global scope
            options=[
                lb.Option(value="excellent"),
                lb.Option(value="good"),
                lb.Option(value="fair"),
                lb.Option(value="poor"),
            ],
        ),
        
        lb.Classification(
            class_type=lb.Classification.Type.TEXT,
            name="notes",
            scope=lb.Classification.Scope.GLOBAL,  # Global scope
        ),
    ],
)

ontology = client.create_ontology(
    f"Audio Token Demo Ontology {str(uuid.uuid4())[:8]}",
    ontology_builder.asdict(),
    media_type=lb.MediaType.Audio,
)
print(f"✅ Ontology created: {ontology.name}")

# Step 3: Create project and connect ontology
print("\n📋 Step 3: Creating project...")
project = client.create_project(
    name=f"Audio Token Demo {str(uuid.uuid4())[:8]}",
    media_type=lb.MediaType.Audio
)

project.connect_ontology(ontology)
print(f"✅ Project created: {project.name}")
print(f"🆔 Project ID: {project.uid}")

# Step 4: Get data row ID and create batch
print("\n📦 Step 4: Creating batch...")

# Get the actual data row object to ensure strong association
data_row = None
for dr in dataset.data_rows():
    if dr.global_key == global_key:
        data_row = dr
        break

if data_row is None:
    print("❌ Could not find data row!")
    exit(1)

print(f"   - Found data row ID: {data_row.uid}")

# Create batch with both global_key AND explicit data row reference
batch = project.create_batch(
    f"audio-token-batch-{str(uuid.uuid4())[:8]}",
    global_keys=[global_key],
    priority=5,
)
print(f"✅ Batch created: {batch.name}")
print(f"   - Data row ID: {data_row.uid}")
print(f"   - Global key: {global_key}")

# Step 5: Create separate speaker temporal annotations
print("\n🎨 Step 5: Creating separate User Speaker temporal annotations...")
print("   Creating per-token temporal annotations for: 'Hello GPT what can you do for me today'")
print("   Using separate User Speaker classifications for each token")
print("   Using sequential time ranges (connected word segments)")

# Define the tokens with non-overlapping time ranges (each range = 1 token)
# Each token gets its own time range with 1-frame gap to avoid conflicts
tokens_data = [
    ("Hello", 586, 770),    # Hello: frames 586-770
    ("GPT", 771, 955),      # GPT: frames 771-955  
    ("what", 956, 1140),    # what: frames 956-1140
    ("can", 1141, 1325),    # can: frames 1141-1325
    ("you", 1326, 1510),    # you: frames 1326-1510
    ("do", 1511, 1695),     # do: frames 1511-1695
    ("for", 1696, 1880),    # for: frames 1696-1880
    ("me", 1881, 2066),     # me: frames 1881-2066 (end of audio)
]

# Create temporal annotations
annotations = []

# Create separate User Speaker annotations for each token with time ranges
for token, start_frame, end_frame in tokens_data:
    user_speaker_annotation = lb_types.AudioClassificationAnnotation(
        frame=start_frame,      # Start frame for this token
        end_frame=end_frame,    # End frame for this token
        name="User Speaker",
        value=lb_types.Text(answer=token)
    )
    annotations.append(user_speaker_annotation)

# Note: No Assistant Speaker annotations in this demo (ontology includes it but no data)

# Global annotations
global_annotations = [
    lb_types.ClassificationAnnotation(
        name="overall_quality",
        value=lb_types.Radio(answer=lb_types.ClassificationAnswer(name="excellent"))
    ),
    lb_types.ClassificationAnnotation(
        name="notes",
        value=lb_types.Text(answer="Demo conversation showing word-level temporal annotation with speaker identification.")
    ),
]

# Combine all annotations
all_annotations = annotations + global_annotations

print(f"✅ Created {len(all_annotations)} total annotations")
print(f"   - User Speaker annotations: {len(tokens_data)} (one per token with time ranges)")
print(f"   - Assistant Speaker annotations: 0 (ontology includes it but no data)")
print(f"   - Global annotations: {len(global_annotations)}")

# Step 6: Create label and upload via MAL
print("\n📤 Step 6: Uploading annotations via MAL...")

label = lb_types.Label(
    data={"global_key": global_key},
    annotations=all_annotations
)

# Upload MAL predictions
try:
    upload_job = lb.MALPredictionImport.create_from_objects(
        client=client,
        project_id=project.uid,
        name=f"audio_token_mal_{str(uuid.uuid4())[:8]}",
        predictions=[label],
    )
    
    print("⏳ Waiting for MAL upload to complete...")
    upload_job.wait_till_done()
    
    print(f"✅ MAL Upload State: {upload_job.state}")
    print(f"📊 Upload Statuses: {upload_job.statuses}")
    if upload_job.errors:
        print(f"❌ Errors: {upload_job.errors}")
    else:
        print(f"ℹ️  No errors reported in upload_job.errors")
    
    # Print more detailed status info
    for status in upload_job.statuses:
        if status.get('status') == 'FAILURE':
            print(f"🔍 FAILURE Details: {status}")
            if 'errors' in status:
                print(f"🔍 Specific Errors: {status['errors']}")
    
    # Check if upload was successful
    upload_successful = upload_job.state in ["COMPLETE", "FINISHED"] or "FINISHED" in str(upload_job.state)
    
except Exception as e:
    print(f"❌ MAL upload failed: {e}")

# Step 7: Verification
print("\n🔍 Step 7: Verification...")
try:
    # Verify batch exists and has correct size
    print(f"   - Batch: {batch.name}")
    print(f"   - Batch size: {batch.size}")
    print(f"   - Project: {project.name}")
    print(f"   - Project ID: {project.uid}")
    
    # Don't rely on broken overview - just confirm core operations worked
    if batch.size > 0:
        print("✅ SUCCESS: Batch created with data row")
    
    if upload_job.status in ["COMPLETE", "FINISHED"]:
        print("✅ SUCCESS: MAL upload completed successfully")
        print("✅ SUCCESS: Audio token temporal annotations are working!")
    
except Exception as e:
    print(f"⚠️  Verification had issues: {e}")

print(f"\n🎉 Audio Token Temporal Annotations Demo Complete!")
print("=" * 50)
print(f"📋 Summary:")
print(f"   - Audio asset uploaded successfully")
print(f"   - Ontology created with speaker and token classifications") 
print(f"   - Project and batch created successfully")
print(f"   - {len(all_annotations)} annotations created with precise timing")
print(f"   - MAL upload successful")

print(f"\n🎯 Demo Details:")
print(f"   - Structure: Two separate speaker classifications (User Speaker, Assistant Speaker)")
print(f"   - Data: User Speaker tokens only (8 words with connected time ranges)")
for token, start_frame, end_frame in tokens_data:
    print(f"     • User Speaker: '{token}' frames {start_frame}-{end_frame}")

print(f"\n🌐 View Project in Browser:")
print(f"   - Project ID: {project.uid}")
print(f"   - URL: http://localhost:3000/projects/{project.uid}/overview")

print(f"\n✨ Key Features Demonstrated:")
print(f"   - Separate speaker classifications (no shared root)")
print(f"   - Word-level temporal annotation")
print(f"   - Sequential time range positioning")
print(f"   - Conversation analysis with speaker separation")
print(f"   - Ontology includes both speakers but data only for User Speaker")

requirements.txt:

# Requirements for exec folder scripts
# Install the local labelbox package in development mode
-e ../libs/labelbox

# Additional dependencies that might be needed
requests>=2.25.0
uuid

# Dependencies for labelbox.types (audio annotations) - from labelbox[data]
shapely>=2.0.3
numpy>=1.25.0
pillow>=10.2.0
typeguard>=4.1.5
imagesize>=1.4.1
pyproj>=3.5.0
pygeotile>=1.0.6
typing-extensions>=4.10.0
opencv-python-headless>=4.9.0.80

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

cursor[bot]

This comment was marked as outdated.

</tr>
</thead>
<tbody>
<tr>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all changes made by GH bot to this file..

start, end = ann.frame, getattr(ann, 'end_frame', None) or ann.frame
frames_data.append({"start": start, "end": end})
frame_mapping[str(start)] = ann.value.answer

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Annotation Value Handling Fails for Complex Types

The _has_changing_values and _create_multi_value_annotation methods incorrectly assume annotation.value.answer is a simple string. This leads to comparison failures and improper JSON serialization for Radio and Checklist classifications, where value.answer is an object or a list.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant