A word-level timestamps on whisper generation pipeline is mismatched to total duration #36228

dobby-seo · 2025-02-17T09:42:02Z

Reproduction

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample, return_timestamps="word")
print(result["text"])

Total duration of the example audio is apporoximately 62s, but the aligned timestamps output is longer than the duration.

{'text': ' Quilter', 'timestamp': (0.72, 1.04)}, {'text': ' is', 'timestamp': (1.04, 1.3)}, {'text': ' the', 'timestamp': (1.3, 1.44)}, {'text': ' apostle', 'timestamp': (1.44, 1.78)}, {'text': ' of', 'timestamp': (1.78, 2.18)}, {'text': ' the', 'timestamp': (2.18, 2.3)}, {'text': ' middle', 'timestamp': (2.3, 2.52)}, {'text': ' classes,', 'timestamp': (2.52, 3.0)}, {'text': ' and', 'timestamp': (3.0, 3.36)}, {'text': ' we', 'timestamp': (3.36, 3.5)}, {'text': ' are', 'timestamp': (3.5, 3.6)}, {'text': ' glad', 'timestamp': (3.6, 3.86)}, {'text': ' to', 'timestamp': (3.86, 4.1)}, {'text': ' welcome', 'timestamp': (4.1, 4.4)}, {'text': ' his', 'timestamp': (4.4, 4.7)}, {'text': ' gospel.', 'timestamp': (4.7, 5.38)}, {'text': ' Nor', 'timestamp': (6.42, 6.46)}, {'text': ' is', 'timestamp': (6.46, 6.78)}, {'text': ' Mr.', 'timestamp': (6.78, 7.22)}, {'text': " Quilter's", 'timestamp': (7.22, 7.62)}, {'text': ' manner', 'timestamp': (7.62, 7.9)}, {'text': ' less', 'timestamp': (7.9, 8.26)}, {'text': ' interesting', 'timestamp': (8.26, 8.82)}, {'text': ' than', 'timestamp': (8.82, 9.24)}, {'text': ' his', 'timestamp': (9.24, 9.52)}, {'text': ' matter.', 'timestamp': (9.52, 10.18)}, {'text': ' He', 'timestamp': (11.02, 11.22)}, {'text': ' tells', 'timestamp': (11.22, 11.5)}, {'text': ' us', 'timestamp': (11.5, 11.82)}, {'text': ' that', 'timestamp': (11.82, 12.12)}, {'text': ' at', 'timestamp': (12.12, 12.4)}, {'text': ' this', 'timestamp': (12.4, 12.56)}, {'text': ' festive', 'timestamp': (12.56, 13.0)}, {'text': ' season', 'timestamp': (13.0, 13.4)}, {'text': ' of', 'timestamp': (13.4, 13.8)}, {'text': ' the', 'timestamp': (13.8, 13.88)}, {'text': ' year,', 'timestamp': (13.88, 14.34)}, {'text': ' with', 'timestamp': (14.34, 14.94)}, {'text': ' Christmas', 'timestamp': (14.94, 15.38)}, {'text': ' and', 'timestamp': (15.38, 15.72)}, {'text': ' roast', 'timestamp': (15.72, 15.98)}, {'text': ' beef', 'timestamp': (15.98, 16.34)}, {'text': ' looming', 'timestamp': (16.34, 16.8)}, {'text': ' before', 'timestamp': (16.8, 17.06)}, {'text': ' us,', 'timestamp': (17.06, 18.26)}, {'text': ' similes', 'timestamp': (18.4, 18.7)}, {'text': ' drawn', 'timestamp': (18.7, 19.02)}, {'text': ' from', 'timestamp': (19.02, 19.34)}, {'text': ' eating', 'timestamp': (19.34, 19.68)}, {'text': ' and', 'timestamp': (19.68, 19.98)}, {'text': ' its', 'timestamp': (19.98, 20.16)}, {'text': ' results', 'timestamp': (20.16, 20.54)}, {'text': ' occur', 'timestamp': (20.54, 20.94)}, {'text': ' most', 'timestamp': (20.94, 21.34)}, {'text': ' readily', 'timestamp': (21.34, 21.74)}, {'text': ' to', 'timestamp': (21.74, 22.04)}, {'text': ' the', 'timestamp': (22.04, 22.14)}, {'text': ' mind.', 'timestamp': (22.14, 22.64)}, {'text': ' He', 'timestamp': (23.34, 23.72)}, {'text': ' has', 'timestamp': (23.72, 23.92)}, {'text': ' grave', 'timestamp': (23.92, 24.22)}, {'text': ' doubts', 'timestamp': (24.22, 24.54)}, {'text': ' whether', 'timestamp': (24.54, 25.02)}, {'text': ' Sir', 'timestamp': (25.02, 25.42)}, {'text': ' Frederick', 'timestamp': (25.42, 25.78)}, {'text': " Leighton's", 'timestamp': (25.78, 26.32)}, {'text': ' work', 'timestamp': (26.32, 26.6)}, {'text': ' is', 'timestamp': (26.6, 26.84)}, {'text': ' really', 'timestamp': (26.84, 27.18)}, {'text': ' Greek', 'timestamp': (27.18, 27.76)}, {'text': ' after', 'timestamp': (27.76, 28.12)}, {'text': ' all,', 'timestamp': (28.12, 28.62)}, {'text': ' and', 'timestamp': (28.68, 28.9)}, {'text': ' can', 'timestamp': (28.9, 29.48)}, {'text': ' discover', 'timestamp': (29.48, 29.9)}, {'text': ' in', 'timestamp': (29.9, 30.16)}, {'text': ' it', 'timestamp': (30.16, 30.36)}, {'text': ' but', 'timestamp': (30.36, 30.56)}, {'text': ' little', 'timestamp': (30.56, 30.9)}, {'text': ' of', 'timestamp': (30.9, 31.2)}, {'text': ' rocky', 'timestamp': (31.2, 31.72)}, {'text': ' Ithaca.', 'timestamp': (31.72, 32.58)}, {'text': " Linnell's", 'timestamp': (33.6, 34.06)}, {'text': ' pictures', 'timestamp': (34.06, 34.34)}, {'text': ' are', 'timestamp': (34.34, 34.94)}, {'text': ' a', 'timestamp': (34.94, 35.1)}, {'text': ' sort', 'timestamp': (35.1, 35.34)}, {'text': ' of', 'timestamp': (35.34, 35.52)}, {'text': ' Upguards', 'timestamp': (35.52, 36.58)}, {'text': ' and', 'timestamp': (36.58, 36.68)}, {'text': ' Adam', 'timestamp': (36.68, 36.96)}, {'text': ' paintings,', 'timestamp': (36.96, 37.76)}, {'text': ' and', 'timestamp': (38.42, 38.48)}, {'text': " Mason's", 'timestamp': (38.48, 38.98)}, {'text': ' exquisite', 'timestamp': (38.98, 39.64)}, {'text': ' idylls', 'timestamp': (39.64, 40.22)}, {'text': ' are', 'timestamp': (40.22, 40.36)}, {'text': ' as', 'timestamp': (40.36, 40.7)}, {'text': ' national', 'timestamp': (40.7, 41.34)}, {'text': ' as', 'timestamp': (41.34, 41.68)}, {'text': ' a', 'timestamp': (41.68, 41.84)}, {'text': ' jingo', 'timestamp': (41.84, 42.1)}, {'text': ' poem.', 'timestamp': (42.1, 42.9)}, {'text': ' Mr.', 'timestamp': (44.36, 44.76)}, {'text': ' Burkett', 'timestamp': (44.76, 45.04)}, {'text': " Foster's", 'timestamp': (45.04, 45.64)}, {'text': ' landscapes', 'timestamp': (45.64, 46.1)}, {'text': ' smile', 'timestamp': (46.1, 46.88)}, {'text': ' at', 'timestamp': (46.88, 47.22)}, {'text': ' one', 'timestamp': (47.22, 47.44)}, {'text': ' much', 'timestamp': (47.44, 47.86)}, {'text': ' in', 'timestamp': (47.86, 48.02)}, {'text': ' the', 'timestamp': (48.02, 48.12)}, {'text': ' same', 'timestamp': (48.12, 48.36)}, {'text': ' way', 'timestamp': (48.36, 48.64)}, {'text': ' that', 'timestamp': (48.64, 48.88)}, {'text': ' Mr.', 'timestamp': (48.88, 49.24)}, {'text': ' Carker', 'timestamp': (49.24, 49.64)}, {'text': ' used', 'timestamp': (49.64, 50.02)}, {'text': ' to', 'timestamp': (50.02, 50.44)}, {'text': ' flash', 'timestamp': (50.44, 50.72)}, {'text': ' his', 'timestamp': (50.72, 50.96)}, {'text': ' teeth.', 'timestamp': (50.96, 51.66)}, {'text': ' And', 'timestamp': (52.36, 52.8)}, {'text': ' Mr.', 'timestamp': (52.8, 53.22)}, {'text': ' John', 'timestamp': (53.22, 53.36)}, {'text': ' Collier', 'timestamp': (53.36, 53.86)}, {'text': ' gives', 'timestamp': (53.86, 54.5)}, {'text': ' his', 'timestamp': (54.5, 54.76)}, {'text': ' sitter', 'timestamp': (54.76, 55.0)}, {'text': ' a', 'timestamp': (55.0, 55.26)}, {'text': ' cheerful', 'timestamp': (55.26, 55.74)}, {'text': ' slap', 'timestamp': (55.74, 56.3)}, {'text': ' on', 'timestamp': (56.3, 56.6)}, {'text': ' the', 'timestamp': (56.6, 56.72)}, {'text': ' back', 'timestamp': (56.72, 57.04)}, {'text': ' before', 'timestamp': (87.22, 87.22)}, {'text': ' he', 'timestamp': (87.22, 87.22)}, {'text': ' says,', 'timestamp': (87.22, 87.22)}, {'text': ' like', 'timestamp': (87.22, 87.22)}, {'text': ' a', 'timestamp': (87.22, 87.22)}, {'text': ' shampooer', 'timestamp': (87.22, 87.22)}, {'text': ' in', 'timestamp': (87.22, 87.22)}, {'text': ' a', 'timestamp': (87.22, 87.22)}, {'text': ' Turkish', 'timestamp': (87.22, 87.22)}, {'text': ' bath,', 'timestamp': (87.22, 87.22)}, {'text': ' Next', 'timestamp': (87.22, 87.22)}, {'text': ' man!', 'timestamp': (87.22, 87.22)}]}

However whisper-timestamped works fine. What is wrong implementation of generation utils for whisper? I tested on same model.

Expected Behaviors

Like whisper-timestamped

{'text': " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Latins' work is really Greek after all, and can discover in it but little of rocky Ithaca. Lennils, pictures, are a sort of upguards and atom paintings, and Mason's exquisite idles are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampoo or a turkish bath. Next man.", 'segments': [{'id': 0, 'seek': 0, 'start': np.float64(0.54), 'end': np.float64(5.36), 'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.', 'tokens': [50364, 2221, 13, 2326, 388, 391, 307, 264, 50244, 295, 264, 2808, 5359, 11, 293, 321, 366, 5404, 281, 2928, 702, 14943, 13, 50692], 'temperature': 0.0, 'avg_logprob': -0.26007109706841625, 'compression_ratio': 1.5741444866920151, 'no_speech_prob': 0.034886039793491364, 'confidence': 0.93, 'words': [{'text': 'Mr.', 'start': np.float64(0.54), 'end': np.float64(0.78), 'confidence': 0.776}, {'text': 'Quilter', 'start': np.float64(0.94), 'end': np.float64(1.26), 'confidence': 0.915}, {'text': 'is', 'start': np.float64(1.26), 'end': np.float64(1.44), 'confidence': 0.966}, {'text': 'the', 'start': np.float64(1.44), 'end': np.float64(1.62), 'confidence': 0.993}, {'text': 'apostle', 'start': np.float64(1.62), 'end': np.float64(2.06), 'confidence': 0.932}, {'text': 'of', 'start': np.float64(2.06), 'end': np.float64(2.32), 'confidence': 0.997}, {'text': 'the', 'start': np.float64(2.32), 'end': np.float64(2.44), 'confidence': 0.995}, {'text': 'middle', 'start': np.float64(2.44), 'end': np.float64(2.68), 'confidence': 0.827}, {'text': 'classes,', 'start': np.float64(2.68), 'end': np.float64(3.2), 'confidence': 0.9}, {'text': 'and', 'start': np.float64(3.48), 'end': np.float64(3.54), 'confidence': 0.922}, {'text': 'we', 'start': np.float64(3.54), 'end': np.float64(3.64), 'confidence': 0.992}, {'text': 'are', 'start': np.float64(3.64), 'end': np.float64(3.8), 'confidence': 0.954}, {'text': 'glad', 'start': np.float64(3.8), 'end': np.float64(4.08), 'confidence': 0.996}, {'text': 'to', 'start': np.float64(4.08), 'end': np.float64(4.3), 'confidence': 0.99}, {'text': 'welcome', 'start': np.float64(4.3), 'end': np.float64(4.58), 'confidence': 0.993}, {'text': 'his', 'start': np.float64(4.58), 'end': np.float64(4.88), 'confidence': 0.854}, {'text': 'gospel.', 'start': np.float64(4.88), 'end': np.float64(5.36), 'confidence': 0.877}]}, {'id': 1, 'seek': 0, 'start': np.float64(6.46), 'end': np.float64(10.3), 'text': " Nor is Mr. Quilter's manner less interesting than his matter.", 'tokens': [50692, 6966, 307, 2221, 13, 2326, 388, 391, 311, 9060, 1570, 1880, 813, 702, 1871, 13, 50926], 'temperature': 0.0, 'avg_logprob': -0.26007109706841625, 'compression_ratio': 1.5741444866920151, 'no_speech_prob': 0.034886039793491364, 'confidence': 0.933, 'words': [{'text': 'Nor', 'start': np.float64(6.46), 'end': np.float64(6.7), 'confidence': 0.955}, {'text': 'is', 'start': np.float64(6.7), 'end': np.float64(7.0), 'confidence': 0.855}, {'text': 'Mr.', 'start': np.float64(7.0), 'end': np.float64(7.24), 'confidence': 0.966}, {'text': "Quilter's", 'start': np.float64(7.38), 'end': np.float64(7.8), 'confidence': 0.99}, {'text': 'manner', 'start': np.float64(7.8), 'end': np.float64(8.12), 'confidence': 0.794}, {'text': 'less', 'start': np.float64(8.12), 'end': np.float64(8.46), 'confidence': 0.774}, {'text': 'interesting', 'start': np.float64(8.46), 'end': np.float64(9.12), 'confidence': 0.989}, {'text': 'than', 'start': np.float64(9.12), 'end': np.float64(9.42), 'confidence': 0.984}, {'text': 'his', 'start': np.float64(9.42), 'end': np.float64(9.68), 'confidence': 0.993}, {'text': 'matter.', 'start': np.float64(9.68), 'end': np.float64(10.3), 'confidence': 0.901}]}, {'id': 2, 'seek': 0, 'start': np.float64(11.14), 'end': np.float64(16.81), 'text': ' He tells us that at this festive season of the year, with Christmas and roast beef looming', 'tokens': [50926, 634, 5112, 505, 300, 412, 341, 42729, 3196, 295, 264, 1064, 11, 365, 5272, 293, 12904, 9256, 450, 10539, 51208], 'temperature': 0.0, 'avg_logprob': -0.26007109706841625, 'compression_ratio': 1.5741444866920151, 'no_speech_prob': 0.034886039793491364, 'confidence': 0.89, 'words': [{'text': 'He', 'start': np.float64(11.14), 'end': np.float64(11.42), 'confidence': 0.996}, {'text': 'tells', 'start': np.float64(11.42), 'end': np.float64(11.66), 'confidence': 0.997}, {'text': 'us', 'start': np.float64(11.66), 'end': np.float64(12.06), 'confidence': 0.998}, {'text': 'that', 'start': np.float64(12.06), 'end': np.float64(12.38), 'confidence': 0.993}, {'text': 'at', 'start': np.float64(12.38), 'end': np.float64(12.52), 'confidence': 0.79}, {'text': 'this', 'start': np.float64(12.52), 'end': np.float64(12.8), 'confidence': 0.994}, {'text': 'festive', 'start': np.float64(12.8), 'end': np.float64(13.18), 'confidence': 0.993}, {'text': 'season', 'start': np.float64(13.18), 'end': np.float64(13.7), 'confidence': 0.999}, {'text': 'of', 'start': np.float64(13.7), 'end': np.float64(13.92), 'confidence': 0.998}, {'text': 'the', 'start': np.float64(13.92), 'end': np.float64(14.22), 'confidence': 0.995}, {'text': 'year,', 'start': np.float64(14.22), 'end': np.float64(14.66), 'confidence': 0.977}, {'text': 'with', 'start': np.float64(14.94), 'end': np.float64(15.14), 'confidence': 0.977}, {'text': 'Christmas', 'start': np.float64(15.14), 'end': np.float64(15.62), 'confidence': 0.9}, {'text': 'and', 'start': np.float64(15.62), 'end': np.float64(15.94), 'confidence': 0.967}, {'text': 'roast', 'start': np.float64(15.94), 'end': np.float64(16.2), 'confidence': 0.552}, {'text': 'beef', 'start': np.float64(16.2), 'end': np.float64(16.54), 'confidence': 0.908}, {'text': 'looming', 'start': np.float64(16.54), 'end': np.float64(16.81), 'confidence': 0.623}]}, {'id': 3, 'seek': 0, 'start': np.float64(16.81), 'end': np.float64(22.76), 'text': ' before us, similarly drawn from eating and its results occur most readily to the mind.', 'tokens': [51208, 949, 505, 11, 14138, 10117, 490, 3936, 293, 1080, 3542, 5160, 881, 26336, 281, 264, 1575, 13, 51552], 'temperature': 0.0, 'avg_logprob': -0.26007109706841625, 'compression_ratio': 1.5741444866920151, 'no_speech_prob': 0.034886039793491364, 'confidence': 0.829, 'words': [{'text': 'before', 'start': np.float64(16.81), 'end': np.float64(17.32), 'confidence': 0.998}, {'text': 'us,', 'start': np.float64(17.32), 'end': np.float64(18.16), 'confidence': 0.997}, {'text': 'similarly', 'start': np.float64(18.56), 'end': np.float64(18.9), 'confidence': 0.357}, {'text': 'drawn', 'start': np.float64(18.9), 'end': np.float64(19.28), 'confidence': 0.411}, {'text': 'from', 'start': np.float64(19.28), 'end': np.float64(19.52), 'confidence': 0.996}, {'text': 'eating', 'start': np.float64(19.52), 'end': np.float64(19.9), 'confidence': 0.986}, {'text': 'and', 'start': np.float64(19.9), 'end': np.float64(20.14), 'confidence': 0.508}, {'text': 'its', 'start': np.float64(20.14), 'end': np.float64(20.3), 'confidence': 0.947}, {'text': 'results', 'start': np.float64(20.3), 'end': np.float64(20.74), 'confidence': 0.992}, {'text': 'occur', 'start': np.float64(20.74), 'end': np.float64(21.18), 'confidence': 0.967}, {'text': 'most', 'start': np.float64(21.18), 'end': np.float64(21.54), 'confidence': 0.963}, {'text': 'readily', 'start': np.float64(21.54), 'end': np.float64(21.92), 'confidence': 0.994}, {'text': 'to', 'start': np.float64(21.92), 'end': np.float64(22.2), 'confidence': 0.991}, {'text': 'the', 'start': np.float64(22.2), 'end': np.float64(22.4), 'confidence': 0.998}, {'text': 'mind.', 'start': np.float64(22.4), 'end': np.float64(22.76), 'confidence': 0.964}]}, {'id': 4, 'seek': 0, 'start': np.float64(23.72), 'end': np.float64(29.46), 'text': " He has grave doubts whether Sir Frederick Latins' work is really Greek after all, and", 'tokens': [51552, 634, 575, 12525, 22618, 1968, 6144, 35617, 7354, 1292, 6, 589, 307, 534, 10281, 934, 439, 11, 293, 51836], 'temperature': 0.0, 'avg_logprob': -0.26007109706841625, 'compression_ratio': 1.5741444866920151, 'no_speech_prob': 0.034886039793491364, 'confidence': 0.68, 'words': [{'text': 'He', 'start': np.float64(23.72), 'end': np.float64(23.9), 'confidence': 0.796}, {'text': 'has', 'start': np.float64(23.9), 'end': np.float64(24.18), 'confidence': 0.995}, {'text': 'grave', 'start': np.float64(24.18), 'end': np.float64(24.44), 'confidence': 0.256}, {'text': 'doubts', 'start': np.float64(24.44), 'end': np.float64(24.84), 'confidence': 0.845}, {'text': 'whether', 'start': np.float64(24.84), 'end': np.float64(25.26), 'confidence': 0.887}, {'text': 'Sir', 'start': np.float64(25.26), 'end': np.float64(25.64), 'confidence': 0.565}, {'text': 'Frederick', 'start': np.float64(25.64), 'end': np.float64(25.92), 'confidence': 0.966}, {'text': "Latins'", 'start': np.float64(25.92), 'end': np.float64(26.54), 'confidence': 0.381}, {'text': 'work', 'start': np.float64(26.54), 'end': np.float64(26.76), 'confidence': 0.997}, {'text': 'is', 'start': np.float64(26.76), 'end': np.float64(27.12), 'confidence': 0.991}, {'text': 'really', 'start': np.float64(27.12), 'end': np.float64(27.38), 'confidence': 0.976}, {'text': 'Greek', 'start': np.float64(27.38), 'end': np.float64(28.02), 'confidence': 0.429}, {'text': 'after', 'start': np.float64(28.02), 'end': np.float64(28.42), 'confidence': 0.801}, {'text': 'all,', 'start': np.float64(28.42), 'end': np.float64(28.88), 'confidence': 0.991}, {'text': 'and', 'start': np.float64(29.3), 'end': np.float64(29.46), 'confidence': 0.937}]}, {'id': 5, 'seek': 2944, 'start': np.float64(29.46), 'end': np.float64(32.6), 'text': ' can discover in it but little of rocky Ithaca.', 'tokens': [50364, 393, 4411, 294, 309, 457, 707, 295, 33301, 286, 392, 6628, 13, 50582], 'temperature': 0.0, 'avg_logprob': -0.3753047799164394, 'compression_ratio': 1.5, 'no_speech_prob': 0.006951496005058289, 'confidence': 0.74, 'words': [{'text': 'can', 'start': np.float64(29.46), 'end': np.float64(29.68), 'confidence': 0.57}, {'text': 'discover', 'start': np.float64(29.68), 'end': np.float64(30.1), 'confidence': 0.986}, {'text': 'in', 'start': np.float64(30.1), 'end': np.float64(30.34), 'confidence': 0.894}, {'text': 'it', 'start': np.float64(30.34), 'end': np.float64(30.52), 'confidence': 0.987}, {'text': 'but', 'start': np.float64(30.52), 'end': np.float64(30.82), 'confidence': 0.818}, {'text': 'little', 'start': np.float64(30.82), 'end': np.float64(31.08), 'confidence': 0.971}, {'text': 'of', 'start': np.float64(31.08), 'end': np.float64(31.32), 'confidence': 0.974}, {'text': 'rocky', 'start': np.float64(31.32), 'end': np.float64(31.94), 'confidence': 0.549}, {'text': 'Ithaca.', 'start': np.float64(31.94), 'end': np.float64(32.6), 'confidence': 0.556}]}, {'id': 6, 'seek': 2944, 'start': np.float64(33.66), 'end': np.float64(40.26), 'text': " Lennils, pictures, are a sort of upguards and atom paintings, and Mason's exquisite idles", 'tokens': [50582, 441, 1857, 4174, 11, 5242, 11, 366, 257, 1333, 295, 493, 2794, 2287, 293, 12018, 14880, 11, 293, 25730, 311, 454, 34152, 4496, 904, 50908], 'temperature': 0.0, 'avg_logprob': -0.3753047799164394, 'compression_ratio': 1.5, 'no_speech_prob': 0.006951496005058289, 'confidence': 0.638, 'words': [{'text': 'Lennils,', 'start': np.float64(33.66), 'end': np.float64(34.2), 'confidence': 0.338}, {'text': 'pictures,', 'start': np.float64(34.28), 'end': np.float64(34.7), 'confidence': 0.952}, {'text': 'are', 'start': np.float64(35.0), 'end': np.float64(35.2), 'confidence': 0.837}, {'text': 'a', 'start': np.float64(35.2), 'end': np.float64(35.36), 'confidence': 0.565}, {'text': 'sort', 'start': np.float64(35.36), 'end': np.float64(35.46), 'confidence': 0.944}, {'text': 'of', 'start': np.float64(35.46), 'end': np.float64(36.1), 'confidence': 0.996}, {'text': 'upguards', 'start': np.float64(36.1), 'end': np.float64(36.7), 'confidence': 0.801}, {'text': 'and', 'start': np.float64(36.7), 'end': np.float64(36.94), 'confidence': 0.912}, {'text': 'atom', 'start': np.float64(36.94), 'end': np.float64(37.3), 'confidence': 0.432}, {'text': 'paintings,', 'start': np.float64(37.3), 'end': np.float64(37.96), 'confidence': 0.961}, {'text': 'and', 'start': np.float64(38.54), 'end': np.float64(38.7), 'confidence': 0.978}, {'text': "Mason's", 'start': np.float64(38.7), 'end': np.float64(39.24), 'confidence': 0.724}, {'text': 'exquisite', 'start': np.float64(39.24), 'end': np.float64(39.8), 'confidence': 0.931}, {'text': 'idles', 'start': np.float64(39.8), 'end': np.float64(40.26), 'confidence': 0.238}]}, {'id': 7, 'seek': 2944, 'start': np.float64(40.28), 'end': np.float64(42.96), 'text': ' are as national as a jingo poem.', 'tokens': [50908, 366, 382, 4048, 382, 257, 361, 18459, 13065, 13, 51132], 'temperature': 0.0, 'avg_logprob': -0.3753047799164394, 'compression_ratio': 1.5, 'no_speech_prob': 0.006951496005058289, 'confidence': 0.85, 'words': [{'text': 'are', 'start': np.float64(40.28), 'end': np.float64(40.66), 'confidence': 0.965}, {'text': 'as', 'start': np.float64(40.66), 'end': np.float64(40.82), 'confidence': 0.93}, {'text': 'national', 'start': np.float64(40.82), 'end': np.float64(41.54), 'confidence': 0.93}, {'text': 'as', 'start': np.float64(41.54), 'end': np.float64(41.9), 'confidence': 0.979}, {'text': 'a', 'start': np.float64(41.9), 'end': np.float64(42.02), 'confidence': 0.958}, {'text': 'jingo', 'start': np.float64(42.02), 'end': np.float64(42.42), 'confidence': 0.59}, {'text': 'poem.', 'start': np.float64(42.42), 'end': np.float64(42.96), 'confidence': 0.996}]}, {'id': 8, 'seek': 2944, 'start': np.float64(44.58), 'end': np.float64(50.38), 'text': " Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used", 'tokens': [51132, 2221, 13, 5637, 74, 3093, 38756, 311, 29822, 7563, 412, 472, 709, 294, 264, 912, 636, 300, 2221, 13, 2741, 5767, 1143, 51412], 'temperature': 0.0, 'avg_logprob': -0.3753047799164394, 'compression_ratio': 1.5, 'no_speech_prob': 0.006951496005058289, 'confidence': 0.762, 'words': [{'text': 'Mr.', 'start': np.float64(44.58), 'end': np.float64(44.86), 'confidence': 0.99}, {'text': 'Berkett', 'start': np.float64(44.96), 'end': np.float64(45.3), 'confidence': 0.517}, {'text': "Foster's", 'start': np.float64(45.3), 'end': np.float64(45.88), 'confidence': 0.743}, {'text': 'landscapes', 'start': np.float64(45.88), 'end': np.float64(46.32), 'confidence': 0.912}, {'text': 'smile', 'start': np.float64(46.32), 'end': np.float64(47.16), 'confidence': 0.774}, {'text': 'at', 'start': np.float64(47.16), 'end': np.float64(47.46), 'confidence': 0.989}, {'text': 'one', 'start': np.float64(47.46), 'end': np.float64(47.76), 'confidence': 0.976}, {'text': 'much', 'start': np.float64(47.76), 'end': np.float64(48.08), 'confidence': 0.99}, {'text': 'in', 'start': np.float64(48.08), 'end': np.float64(48.26), 'confidence': 0.899}, {'text': 'the', 'start': np.float64(48.26), 'end': np.float64(48.36), 'confidence': 0.97}, {'text': 'same', 'start': np.float64(48.36), 'end': np.float64(48.6), 'confidence': 0.994}, {'text': 'way', 'start': np.float64(48.6), 'end': np.float64(48.84), 'confidence': 0.994}, {'text': 'that', 'start': np.float64(48.84), 'end': np.float64(49.02), 'confidence': 0.959}, {'text': 'Mr.', 'start': np.float64(49.02), 'end': np.float64(49.26), 'confidence': 0.985}, {'text': 'Carker', 'start': np.float64(49.42), 'end': np.float64(50.04), 'confidence': 0.396}, {'text': 'used', 'start': np.float64(50.04), 'end': np.float64(50.38), 'confidence': 0.677}]}, {'id': 9, 'seek': 2944, 'start': np.float64(50.38), 'end': np.float64(51.68), 'text': ' to flash his teeth.', 'tokens': [51412, 281, 7319, 702, 7798, 13, 51540], 'temperature': 0.0, 'avg_logprob': -0.3753047799164394, 'compression_ratio': 1.5, 'no_speech_prob': 0.006951496005058289, 'confidence': 0.909, 'words': [{'text': 'to', 'start': np.float64(50.38), 'end': np.float64(50.62), 'confidence': 0.98}, {'text': 'flash', 'start': np.float64(50.62), 'end': np.float64(50.9), 'confidence': 0.719}, {'text': 'his', 'start': np.float64(50.9), 'end': np.float64(51.22), 'confidence': 0.974}, {'text': 'teeth.', 'start': np.float64(51.22), 'end': np.float64(51.68), 'confidence': 0.995}]}, {'id': 10, 'seek': 2944, 'start': np.float64(52.8), 'end': np.float64(58.77), 'text': ' And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like', 'tokens': [51540, 400, 2221, 13, 2619, 4586, 811, 2709, 702, 47335, 257, 36942, 21075, 322, 264, 646, 949, 415, 1619, 11, 411, 51826], 'temperature': 0.0, 'avg_logprob': -0.3753047799164394, 'compression_ratio': 1.5, 'no_speech_prob': 0.006951496005058289, 'confidence': 0.814, 'words': [{'text': 'And', 'start': np.float64(52.8), 'end': np.float64(53.02), 'confidence': 0.699}, {'text': 'Mr.', 'start': np.float64(53.02), 'end': np.float64(53.28), 'confidence': 0.987}, {'text': 'John', 'start': np.float64(53.38), 'end': np.float64(53.58), 'confidence': 0.873}, {'text': 'Collier', 'start': np.float64(53.58), 'end': np.float64(54.2), 'confidence': 0.586}, {'text': 'gives', 'start': np.float64(54.2), 'end': np.float64(54.7), 'confidence': 0.927}, {'text': 'his', 'start': np.float64(54.7), 'end': np.float64(54.94), 'confidence': 0.919}, {'text': 'sitter', 'start': np.float64(54.94), 'end': np.float64(55.22), 'confidence': 0.601}, {'text': 'a', 'start': np.float64(55.22), 'end': np.float64(55.6), 'confidence': 0.991}, {'text': 'cheerful', 'start': np.float64(55.6), 'end': np.float64(55.98), 'confidence': 0.89}, {'text': 'slap', 'start': np.float64(55.98), 'end': np.float64(56.56), 'confidence': 0.953}, {'text': 'on', 'start': np.float64(56.56), 'end': np.float64(56.8), 'confidence': 0.794}, {'text': 'the', 'start': np.float64(56.8), 'end': np.float64(56.96), 'confidence': 0.971}, {'text': 'back', 'start': np.float64(56.96), 'end': np.float64(57.38), 'confidence': 0.99}, {'text': 'before', 'start': np.float64(57.38), 'end': np.float64(57.92), 'confidence': 0.474}, {'text': 'he', 'start': np.float64(57.92), 'end': np.float64(58.18), 'confidence': 0.986}, {'text': 'says,', 'start': np.float64(58.18), 'end': np.float64(58.52), 'confidence': 0.941}, {'text': 'like', 'start': np.float64(58.58), 'end': np.float64(58.77), 'confidence': 0.822}]}, {'id': 11, 'seek': 5868, 'start': np.float64(58.77), 'end': np.float64(60.82), 'text': ' a shampoo or a turkish bath.', 'tokens': [50364, 257, 27484, 420, 257, 3243, 74, 742, 6079, 13, 50502], 'temperature': 0.0, 'avg_logprob': -0.8035992454080021, 'compression_ratio': 0.8260869565217391, 'no_speech_prob': 0.06087257340550423, 'confidence': 0.538, 'words': [{'text': 'a', 'start': np.float64(58.77), 'end': np.float64(59.14), 'confidence': 0.878}, {'text': 'shampoo', 'start': np.float64(59.14), 'end': np.float64(59.56), 'confidence': 0.251}, {'text': 'or', 'start': np.float64(59.56), 'end': np.float64(59.84), 'confidence': 0.216}, {'text': 'a', 'start': np.float64(59.84), 'end': np.float64(60.0), 'confidence': 0.732}, {'text': 'turkish', 'start': np.float64(60.0), 'end': np.float64(60.44), 'confidence': 0.599}, {'text': 'bath.', 'start': np.float64(60.44), 'end': np.float64(60.82), 'confidence': 0.941}]}, {'id': 12, 'seek': 5868, 'start': np.float64(61.34), 'end': np.float64(62.0), 'text': ' Next man.', 'tokens': [50502, 3087, 587, 13, 50528], 'temperature': 0.0, 'avg_logprob': -0.8035992454080021, 'compression_ratio': 0.8260869565217391, 'no_speech_prob': 0.06087257340550423, 'confidence': 0.858, 'words': [{'text': 'Next', 'start': np.float64(61.34), 'end': np.float64(61.66), 'confidence': 0.908}, {'text': 'man.', 'start': np.float64(61.66), 'end': np.float64(62.0), 'confidence': 0.81}]}], 'language': 'en'}

@ylacombe @eustlb

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A word-level timestamps on whisper generation pipeline is mismatched to total duration #36228

A word-level timestamps on whisper generation pipeline is mismatched to total duration #36228

dobby-seo commented Feb 17, 2025

A word-level timestamps on whisper generation pipeline is mismatched to total duration #36228

A word-level timestamps on whisper generation pipeline is mismatched to total duration #36228

Comments

dobby-seo commented Feb 17, 2025

Reproduction

Expected Behaviors