Consider using object instead of string in index #418

hagenw · 2024-02-14T08:14:42Z

It turns out that even in version 2.2.0 of pandas the new string dtype is not up to the same speed for some tasks, and unfortunately one of them is indexing:

import pandas as pd
import timeit

points = 1000000
data = [f"data-{n}" for n in range(points)]
for dtype in ["object", "string", "string[pyarrow]"]:
    index = pd.Index([f"index-{n}" for n in range(points)], dtype=dtype)
    df = pd.DataFrame(data, index=index, dtype=dtype)
    print(dtype)
    %timeit df.loc['index-2000']

which returns

object
9.78 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string
15.7 µs ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string[pyarrow]
17.6 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

So we might consider switching to store the file index again as object dtype as we do now for the dependencies in audb (audeering/audb#371). The only problem is, that in audb the change is hidden for the user, whereas here it would be a breaking change.

The text was updated successfully, but these errors were encountered:

hagenw added enhancement New feature or request question Further information is requested labels Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using object instead of string in index #418

Consider using object instead of string in index #418

hagenw commented Feb 14, 2024

Consider using object instead of string in index #418

Consider using object instead of string in index #418

Comments

hagenw commented Feb 14, 2024