Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Feature.process_*() #103

Merged
merged 2 commits into from
Mar 13, 2023
Merged

Speed up Feature.process_*() #103

merged 2 commits into from
Mar 13, 2023

Conversation

frankenjoe
Copy link
Collaborator

@frankenjoe frankenjoe commented Mar 10, 2023

Relates to #100

Speeds up Feature.process_*() by avoiding calls to pd.concat() on a lot of small pd.DataFrame objects. Instead, we now concatenate lists / numpy arrays and create the final pd.DataFrame from those.

Benchmark

import time
import audb
import audinterface


db = audb.load(
    'emodb',
    version='1.3.0',
    format='wav',
    sampling_rate=16000,
    mixdown=True,
)

def process_func(x, sr):
    return [x.mean(), x.std()]

feature = audinterface.Feature(
    ['mean', 'std'],
    process_func=process_func,
)

t = time.time()
feature.process_files(db.files)
print(time.time() - t)
  • main branch: ~8.5 s
  • this branch: ~4.5 s

@frankenjoe
Copy link
Collaborator Author

frankenjoe commented Mar 10, 2023

Together with #102 we should be down to ~0.35s, which is close to what we target in #100.

@codecov
Copy link

codecov bot commented Mar 10, 2023

Codecov Report

Merging #103 (c11ca68) into main (7ad15b7) will not change coverage.
The diff coverage is 100.0%.

Impacted Files Coverage Δ
audinterface/core/feature.py 100.0% <100.0%> (ø)

@frankenjoe frankenjoe requested a review from hagenw March 10, 2023 08:06
@hagenw
Copy link
Member

hagenw commented Mar 13, 2023

You are not only concatenating lists, but also numpy arrays. But I guess the problem was concatenating with pandas before which seems to be much slower (most likely due to index related stuff) than concatenating with numpy.

@frankenjoe
Copy link
Collaborator Author

frankenjoe commented Mar 13, 2023

You are not only concatenating lists, but also numpy arrays

Actually, in the win_dur=None case we even pre-allocate the numpy array and assign directly to it. When a sliding is used this is not possible as we cannot know the matrix size in advance, so here we rely on np.concatenate(), which is still much faster than pd.concat().

@hagenw
Copy link
Member

hagenw commented Mar 13, 2023

OK, thanks for clarification.

@hagenw hagenw merged commit f4687e1 into main Mar 13, 2023
@hagenw hagenw deleted the speed-up-feature branch March 13, 2023 09:58
@hagenw
Copy link
Member

hagenw commented Mar 13, 2023

With the incorporation of #102 the benchmark is down to ~0.26 s for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants