Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .narrowband() effect (mulaw, lpc10 codecs) #1348

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

rouseabout
Copy link
Contributor

This patch adds a audio codec transformation.

I have found that when applying K2 ASR to speech compressed with mulaw, it is advantageous to augment the training data with these codecs. The transformation resamples the input audio to 8kHz, encodes then decodes using specified codec, then restores the original sample rate (e.g. 16 kHz).

Open issues:

  • The transformation is called phone(). But maybe a better name is needed?
  • Since it significantly alters the audio, depending on codec, I am wondering how best to test the transformation?

Example use:

cs2 = CutSet.from_manifests(...).phone(codec="mulaw")
cs3 = CutSet.from_manifests(...).phone(codec="lpc10")

libspandsp is required to use the lpc10 codec. Use apt-get install libspandsp-dev on Debian/Ubuntu.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! I left a few comments. Could you also add unit tests for this transform?



@dataclass
class Phone(AudioTransform):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest calling it Narrowband and renaming the methods to narrowband. Also upsampling back to original SR should be optional (restore_orig_sr=True).

Resample input audio to 8000 Hz, apply codec (encode then immediately decode), then resample back to the original sampling rate.
"""

source_sampling_rate: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this option at all. You can call get_or_create_resampler directly in __call__ using the input example's actual sampling rate. This way this transform can work with datasets of mixed sampling rates.

lhotse/augmentation/torchaudio.py Show resolved Hide resolved
lhotse/augmentation/torchaudio.py Show resolved Hide resolved
@rouseabout
Copy link
Contributor Author

I have addressed everything except for restore_orig_sr=True. I am not sure how to achieve that!

@pzelasko
Copy link
Collaborator

pzelasko commented Jun 9, 2024

I have addressed everything except for restore_orig_sr=True. I am not sure how to achieve that!

You are very close! Add a parameter restore_orig_sr=True in def narrowband(self, ...) for cut and recording, and pass the provided argument to Narrowband constructor. Then you can extend the condition for the second resampling to if self.restore_orig_sr and sampling_rate != 8000).

@rouseabout
Copy link
Contributor Author

Done, but something extra is needed, because when I apply the transformation with use_orig_sr=False the following exception occurs:

AudioLoadingError: The number of declared samples in the recording diverged from the one obtained when loading audio (offset=0, duration=19.22419501133787). This could be internal Lhotse's error or a faulty transform implementation. Please report this issue in Lhotse and show the following: diff=693887, audio.shape=(1, 153900), recording=Recording(id='0_nb_lpc10', sources=[AudioSource(type='file', channels=[0], source='/home/user/workspace/rtvalid/0.wav')], sampling_rate=44100, num_samples=847787, duration=19.22419501133787, channel_ids=[0], transforms=[{'name': 'Narrowband', 'kwargs': {'codec': 'lpc10', 'restore_orig_sr': False}}])

@pzelasko
Copy link
Collaborator

If you don't restore orig sr, you'll have to update both sampling_rate and num_samples property on the Recording object.

@pzelasko pzelasko added this to the v1.25.0 milestone Jul 18, 2024
@pzelasko
Copy link
Collaborator

Thanks for the contribution, merging!

@pzelasko pzelasko merged commit 18436e9 into lhotse-speech:master Jul 18, 2024
9 of 11 checks passed
@pzelasko pzelasko changed the title augmentation/torchaudio: add Phone effect (mulaw, lpc10 codecs) Add .narrowband() effect (mulaw, lpc10 codecs) Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants