Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: readData utterly slow for files with many lines of data #57

Open
FObersteiner opened this issue Feb 24, 2022 · 2 comments
Open

Comments

@FObersteiner
Copy link
Contributor

FObersteiner commented Feb 24, 2022

Description

Loading data from small files completes in a decent amount of time. With many lines of data (10k+), the process becomes a "bottleneck".

What I Did

read 4.3k lines of data, ffi1001:

%timeit myfile.readData()
67.9 ms ± 7.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

read 86.6k lines of data, ffi1001:

%timeit myfile.readData()
51.5 s ± 2.54 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's nearly a minute per file! If I'd want to load many such files, I'd have to go have a lot of coffee in the meantime ☕👾


tracing the execution of the call to readData, I find

@agstephens
Copy link
Member

@FObersteiner, I agree that we should look at this. Do you have publicly downloadable large example files that we could use in unit/integration testing?

@FObersteiner
Copy link
Contributor Author

FObersteiner commented Mar 10, 2022

@agstephens jup, I was about to create some public sample data from our ozone instruments anyway ;-) you can find them here: https://git.scc.kit.edu/FObersteiner/pyFairoproc/-/tree/master/samples.

The one that's problematic in this context (nappy reading data) is the cl_photometer file (~86k lines of data, just one variable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants