Add GigaSpeech 2 recipe #1365

yfyeung · 2024-06-28T09:00:29Z

This PR adds a recipe for GigaSpeech 2.
GigaSpeech 2 raw comprises about 30,000 hours of automatically transcribed speech across Thai, Indonesian, and Vietnamese. GigaSpeech 2 refined consists of 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. GigaSpeech 2 test sets more realistically reflect speech recognition scenarios and mirror the real performance of an ASR system for low-resource languages.

For more details, please visit:
Dataset: https://huggingface.co/datasets/speechcolab/gigaspeech2
Preprint paper: https://arxiv.org/pdf/2406.11546

pzelasko

Thanks!! The recipe looks good to me, although I have one suggestion. If you could re-use the streaming manifest writing mechanism from GigaSpeech 1 recipe, it would allow users to prepare this dataset with minimal memory usage. As-is, it will take a lot of CPU memory to hold the entire manifest in memory before writing it to disk. See:

lhotse/lhotse/recipes/gigaspeech.py

Lines 96 to 129 in da4d70d

    
           with RecordingSet.open_writer( 
        
               output_dir / f"gigaspeech_recordings_{part}.jsonl.gz" 
        
           ) as rec_writer, SupervisionSet.open_writer( 
        
               output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz" 
        
           ) as sup_writer, CutSet.open_writer( 
        
               output_dir / f"gigaspeech_cuts_{part}.jsonl.gz" 
        
           ) as cut_writer: 
        
               for recording, segments in tqdm( 
        
                   parallel_map( 
        
                       parse_utterance, 
        
                       gigaspeech.audios("{" + part + "}"), 
        
                       repeat(gigaspeech.gigaspeech_dataset_dir), 
        
                       num_jobs=num_jobs, 
        
                   ), 
        
                   desc="Processing GigaSpeech JSON entries", 
        
               ): 
        
                   # Fix and validate the recording + supervisions 
        
                   recordings, segments = fix_manifests( 
        
                       recordings=RecordingSet.from_recordings([recording]), 
        
                       supervisions=SupervisionSet.from_segments(segments), 
        
                   ) 
        
                   validate_recordings_and_supervisions( 
        
                       recordings=recordings, supervisions=segments 
        
                   ) 
        
                   # Create the cut since most users will need it anyway. 
        
                   # There will be exactly one cut since there's exactly one recording. 
        
                   cuts = CutSet.from_manifests( 
        
                       recordings=recordings, supervisions=segments 
        
                   ) 
        
                   # Write the manifests 
        
                   rec_writer.write(recordings[0]) 
        
                   for s in segments: 
        
                       sup_writer.write(s) 
        
                   cut_writer.write(cuts[0])

yfyeung · 2024-07-03T16:32:02Z

Thanks!! The recipe looks good to me, although I have one suggestion. If you could re-use the streaming manifest writing mechanism from GigaSpeech 1 recipe, it would allow users to prepare this dataset with minimal memory usage. As-is, it will take a lot of CPU memory to hold the entire manifest in memory before writing it to disk. See:

lhotse/lhotse/recipes/gigaspeech.py

Lines 96 to 129 in da4d70d

with RecordingSet.open_writer(

output_dir / f"gigaspeech_recordings_{part}.jsonl.gz"

) as rec_writer, SupervisionSet.open_writer(

output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz"

) as sup_writer, CutSet.open_writer(

output_dir / f"gigaspeech_cuts_{part}.jsonl.gz"

) as cut_writer:

for recording, segments in tqdm(

parallel_map(

parse_utterance,

gigaspeech.audios("{" + part + "}"),

repeat(gigaspeech.gigaspeech_dataset_dir),

num_jobs=num_jobs,

),

desc="Processing GigaSpeech JSON entries",

):

# Fix and validate the recording + supervisions

recordings, segments = fix_manifests(

recordings=RecordingSet.from_recordings([recording]),

supervisions=SupervisionSet.from_segments(segments),

)

validate_recordings_and_supervisions(

recordings=recordings, supervisions=segments

)

# Create the cut since most users will need it anyway.

# There will be exactly one cut since there's exactly one recording.

cuts = CutSet.from_manifests(

recordings=recordings, supervisions=segments

)

# Write the manifests

rec_writer.write(recordings[0])

for s in segments:

sup_writer.write(s)

cut_writer.write(cuts[0])

Sure, I will implement this later.

yfyeung added 4 commits June 28, 2024 01:54

add recipe for gigaspeech2

be42a33

fix flake8 and isort

2cc34ca

remove comments

1d721b4

small fix for train_raw & train_refined

83aef17

pzelasko reviewed Jul 3, 2024

View reviewed changes

pzelasko added this to the v1.25.0 milestone Jul 3, 2024

Merge branch 'lhotse-speech:master' into master

63dce67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GigaSpeech 2 recipe #1365

Add GigaSpeech 2 recipe #1365

yfyeung commented Jun 28, 2024 •

edited

Loading

pzelasko left a comment

yfyeung commented Jul 3, 2024

	with RecordingSet.open_writer(
	output_dir / f"gigaspeech_recordings_{part}.jsonl.gz"
	) as rec_writer, SupervisionSet.open_writer(
	output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz"
	) as sup_writer, CutSet.open_writer(
	output_dir / f"gigaspeech_cuts_{part}.jsonl.gz"
	) as cut_writer:
	for recording, segments in tqdm(
	parallel_map(
	parse_utterance,
	gigaspeech.audios("{" + part + "}"),
	repeat(gigaspeech.gigaspeech_dataset_dir),
	num_jobs=num_jobs,
	),
	desc="Processing GigaSpeech JSON entries",
	):
	# Fix and validate the recording + supervisions
	recordings, segments = fix_manifests(
	recordings=RecordingSet.from_recordings([recording]),
	supervisions=SupervisionSet.from_segments(segments),
	)
	validate_recordings_and_supervisions(
	recordings=recordings, supervisions=segments
	)
	# Create the cut since most users will need it anyway.
	# There will be exactly one cut since there's exactly one recording.
	cuts = CutSet.from_manifests(
	recordings=recordings, supervisions=segments
	)
	# Write the manifests
	rec_writer.write(recordings[0])
	for s in segments:
	sup_writer.write(s)
	cut_writer.write(cuts[0])

Add GigaSpeech 2 recipe #1365

Are you sure you want to change the base?

Add GigaSpeech 2 recipe #1365

Conversation

yfyeung commented Jun 28, 2024 • edited Loading

pzelasko left a comment

Choose a reason for hiding this comment

yfyeung commented Jul 3, 2024

yfyeung commented Jun 28, 2024 •

edited

Loading