Speed up feature extraction #100

frankenjoe · 2023-02-08T15:48:12Z

When extracting features with Feature we currently rely on Process under the hood, which returns a pd.Series with feature vectors. We then convert these to a list and afterwards call pd.concat(list) to combine them to a single matrix. The last step can take quite long (sometimes as long or longer as the feature extraction itself). We could speed this up if we pre-allocate a matrix beforehand and directly assign the values. At least when not processing with a sliding window this should be possible.

The text was updated successfully, but these errors were encountered:

frankenjoe · 2023-02-08T16:12:28Z

To demonstrate there's quite some room for improvement:

import pandas as pd

import audb
import audinterface
import audiofile


db = audb.load(
    'emodb',
    version='1.3.0',
    format='wav',
    sampling_rate=16000,
    mixdown=True,
)
files = db.files

def process_func(x, sr):
    return [x.mean(), x.std()]

# slow

feature = audinterface.Feature(
    ['mean', 'std'],
    process_func=process_func,
)

t = time.time()
df = feature.process_files(files)
print(time.time() - t)

# fast

t = time.time()
data = np.empty(
    (len(files), 2),
    dtype=np.float32,
)

for idx, file in enumerate(files):
    signal, sampling_rate = audiofile.read(file)
    data[idx, :] = process_func(
        signal,
        sampling_rate,
    )

df_fast = pd.DataFrame(
    data,
    index=df.index,
    columns=df.columns,
)
print(time.time() - t)

pd.testing.assert_frame_equal(df, df_fast)

5.972992181777954
0.17418813705444336

hagenw · 2023-02-09T08:18:52Z

We then convert these to a list

I guess the idea for a solution is to avoid this step?

frankenjoe · 2023-02-09T08:20:34Z

Yes, especially the concatenation of the DataFrames seems awefully slow. So the idea would be to create a matrix of the expected size (samples x features) and directly assign the extracted features. This is of course only possible if no sliding window is selected as otherwise we cannot know the shape of the final matrix.

hagenw · 2023-03-13T11:39:46Z

After #102, #103, and #104 the above test now returns the following for me:

0.23550820350646973                                                                                 
0.17041683197021484

Can we close here, or is there further room for improvement?

frankenjoe · 2023-03-13T12:25:49Z

I guess not, the comparison is also not 100% fair as in the second case we rely on the index created by Feature. What is still missing is a sped up of Segment. So we either expand this issue or we create a new one.

hagenw · 2023-03-13T12:48:05Z

I created #106 to track Segment and will close here.

frankenjoe added the enhancement New feature or request label Feb 8, 2023

frankenjoe mentioned this issue Feb 8, 2023

Remove deprecated kwargs for process_func_args #55

Merged

This was referenced Mar 9, 2023

Speed up Process.process_*() #102

Merged

Speed up Feature.process_*() #103

Merged

Speed up ProcessWithContext.process_*() #104

Merged

hagenw mentioned this issue Mar 13, 2023

Speed up audinterface.Segment #106

Closed

hagenw closed this as completed Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up feature extraction #100

Speed up feature extraction #100

frankenjoe commented Feb 8, 2023 •

edited

Loading

frankenjoe commented Feb 8, 2023

hagenw commented Feb 9, 2023

frankenjoe commented Feb 9, 2023

hagenw commented Mar 13, 2023

frankenjoe commented Mar 13, 2023

hagenw commented Mar 13, 2023

Speed up feature extraction #100

Speed up feature extraction #100

Comments

frankenjoe commented Feb 8, 2023 • edited Loading

frankenjoe commented Feb 8, 2023

hagenw commented Feb 9, 2023

frankenjoe commented Feb 9, 2023

hagenw commented Mar 13, 2023

frankenjoe commented Mar 13, 2023

hagenw commented Mar 13, 2023

frankenjoe commented Feb 8, 2023 •

edited

Loading