Skip to content

Latest commit

 

History

History
47 lines (22 loc) · 3.73 KB

README.md

File metadata and controls

47 lines (22 loc) · 3.73 KB

Project on Classification of audio based datasets

Overview

Welcome to the Audio Classifier Project! This repository showcases the development of two distinct audio classifiers trained on separate datasets. In one model, we focused on classifying Capuchin bird vocalizations, while the other model was designed to classify a variety of urban sounds from the UrbanSound8K dataset. Each model addresses a unique aspect of audio classification, contributing to a comprehensive exploration of audio recognition.

Capuchin Bird Audio Classification

In the first part of our project, we concentrated on recognizing and classifying the unique vocalizations of Capuchin birds. We collected a specialized Capuchin bird audio dataset containing recordings of different species of these birds. Through the use of Mel spectrograms and a Convolutional Neural Network (CNN) architecture, we created a model capable of distinguishing between various Capuchin bird calls. This model can aid researchers in studying these avian species and their vocalizations in different contexts.

UrbanSound8K Audio Classification

The second phase of our project involved working with the UrbanSound8K dataset, which encompasses a wide range of urban environmental sounds. Our objective was to design a classifier that can differentiate between various urban sounds, including sirens, car horns, and drilling noises. By leveraging Mel spectrograms and a similar CNN architecture, we developed a second model with the ability to identify and categorize urban sound events accurately. This model has applications in urban soundscape analysis, noise pollution assessment, and city planning.

Mel Spectrograms

In our project, we employed Mel spectrograms as a vital preprocessing step to transform raw audio data into a format suitable for our Convolutional Neural Network (CNN) classifiers. Mel spectrograms offer a powerful way to represent audio signals in a visual form that captures both their frequency and temporal characteristics.

What is a Mel Spectrogram?

A Mel spectrogram is a 2D representation that displays the intensity of different frequencies of a sound signal as it evolves over time. It's generated by breaking down the audio signal into small overlapping segments and calculating the energy distribution across various frequency bands. The result is a matrix where each element represents the magnitude of a particular frequency at a specific time interval.

Why Mel Spectrograms?

Using Mel spectrograms has several advantages:

  • Frequency Scaling: The human auditory system doesn't perceive sound linearly across the entire frequency spectrum. Mel spectrograms apply a logarithmic scaling that matches human perception better, making them ideal for tasks like speech and audio recognition.

  • Feature Extraction: By converting raw audio into image-like data, Mel spectrograms enable the application of image processing techniques and neural networks, such as CNNs, which excel at analyzing visual patterns.

  • Mel Spectrogram Example Here's an example of a Mel spectrogram generated from an audio recording of a Capuchin bird call:

Mel Spectrogram Example

Below is an example of a Mel spectrogram generated using torchaudio from an audio recording of a Capuchin bird call:

image

In the spectrogram, the x-axis represents time, the y-axis represents different frequency bins (converted to Mel scale), and the color intensity represents the magnitude of the corresponding frequency component at a given time.

By utilizing Mel spectrograms, we bridge the gap between audio and visual analysis, enabling our models to recognize intricate audio patterns and make accurate predictions.