You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When plotting line charts with many columns or rows, DataFrame.plot() currently adds one Line2D object per column. This incurs significant overhead in large datasets.
Replacing this with a single LineCollection (from matplotlib.collections) can yield substantial speedups. In my benchmarks, plotting via LineCollection was ~2.5× faster on large DataFrames with many columns.
Minimal example:
# Imports and data generationimportitertoolsimportmatplotlib.pyplotaspltimportnumpyasnpimportpandasaspdfrommatplotlib.collectionsimportLineCollectionnum_rows=500num_cols=2000test_df=pd.DataFrame(np.random.randn(num_rows, num_cols).cumsum(axis=0))
# Simply using DataFrame.plot, (5.6 secs)test_df.plot(legend=False, figsize=(12, 8))
plt.show()
# Optimized version using LineCollection (2.2 secs)x=np.arange(len(test_df.index))
lines= [np.column_stack([x, test_df[col].values]) forcolintest_df.columns]
default_colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
color_cycle=list(itertools.islice(itertools.cycle(default_colors), len(lines)))
line_collection=LineCollection(lines, colors=color_cycle)
fig, ax=plt.subplots(figsize=(12, 8))
ax.add_collection(line_collection)
ax.margins(0.05)
plt.show()
Note: the ~2.5x speed improvement is specific to dataframes with integer index. For dataframes with DatetimeIndex the actual speed improvement is ~27x when combined with the workaround here: #61398
Thank you for considering this suggestion!
The text was updated successfully, but these errors were encountered:
Confirmed on main and in my testing as well. I am aware that this is relatively linked to #61398, but I do think that this should be kept open as a separate issue since they tackle different performance bottlenecks.
I was reviewing your idea and realized it might be a bit too much for me to handle alone. I thought at least I could come up with some performance benchmarking that can track performance issues with larger datasets. I'll free up the the assignment so another can take a crack at it, and I'm open to any feedback on the PR.
Description:
When plotting line charts with many columns or rows, DataFrame.plot() currently adds one Line2D object per column. This incurs significant overhead in large datasets.
Replacing this with a single LineCollection (from matplotlib.collections) can yield substantial speedups. In my benchmarks, plotting via LineCollection was ~2.5× faster on large DataFrames with many columns.
Minimal example:
Note: the ~2.5x speed improvement is specific to dataframes with integer index. For dataframes with
DatetimeIndex
the actual speed improvement is ~27x when combined with the workaround here: #61398Thank you for considering this suggestion!
The text was updated successfully, but these errors were encountered: