Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up feature extraction #100

Closed
frankenjoe opened this issue Feb 8, 2023 · 6 comments
Closed

Speed up feature extraction #100

frankenjoe opened this issue Feb 8, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@frankenjoe
Copy link
Collaborator

frankenjoe commented Feb 8, 2023

When extracting features with Feature we currently rely on Process under the hood, which returns a pd.Series with feature vectors. We then convert these to a list and afterwards call pd.concat(list) to combine them to a single matrix. The last step can take quite long (sometimes as long or longer as the feature extraction itself). We could speed this up if we pre-allocate a matrix beforehand and directly assign the values. At least when not processing with a sliding window this should be possible.

@frankenjoe
Copy link
Collaborator Author

To demonstrate there's quite some room for improvement:

import pandas as pd

import audb
import audinterface
import audiofile


db = audb.load(
    'emodb',
    version='1.3.0',
    format='wav',
    sampling_rate=16000,
    mixdown=True,
)
files = db.files

def process_func(x, sr):
    return [x.mean(), x.std()]

# slow

feature = audinterface.Feature(
    ['mean', 'std'],
    process_func=process_func,
)

t = time.time()
df = feature.process_files(files)
print(time.time() - t)

# fast

t = time.time()
data = np.empty(
    (len(files), 2),
    dtype=np.float32,
)

for idx, file in enumerate(files):
    signal, sampling_rate = audiofile.read(file)
    data[idx, :] = process_func(
        signal,
        sampling_rate,
    )

df_fast = pd.DataFrame(
    data,
    index=df.index,
    columns=df.columns,
)
print(time.time() - t)

pd.testing.assert_frame_equal(df, df_fast)
5.972992181777954
0.17418813705444336

@hagenw
Copy link
Member

hagenw commented Feb 9, 2023

We then convert these to a list

I guess the idea for a solution is to avoid this step?

@frankenjoe
Copy link
Collaborator Author

Yes, especially the concatenation of the DataFrames seems awefully slow. So the idea would be to create a matrix of the expected size (samples x features) and directly assign the extracted features. This is of course only possible if no sliding window is selected as otherwise we cannot know the shape of the final matrix.

@hagenw
Copy link
Member

hagenw commented Mar 13, 2023

After #102, #103, and #104 the above test now returns the following for me:

0.23550820350646973                                                                                 
0.17041683197021484

Can we close here, or is there further room for improvement?

@frankenjoe
Copy link
Collaborator Author

I guess not, the comparison is also not 100% fair as in the second case we rely on the index created by Feature. What is still missing is a sped up of Segment. So we either expand this issue or we create a new one.

@hagenw
Copy link
Member

hagenw commented Mar 13, 2023

I created #106 to track Segment and will close here.

@hagenw hagenw closed this as completed Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants