Questions for a conference talk #1390

ianozsvald · 2020-04-23T11:03:45Z

Hello, I'm giving a conference talk on Saturday, I'm spending a few slides on Modin. I have little practical experience, I've dabbled in recent months and my current benchmarks are confusing. I've read "Towards Scalable Dataframe Systems" (2020), "Scaling Interactive Data Science Transparently with Modin" (2018) and your site.

I've also queried colleagues and of the few that have used Modin their advice has been quite mixed (some had speed-ups, some had slow-downs, some had crashes). I'd like to tell a sensible story so I have a couple of questions, if you have any input it'd be very much appreciated (an apologies for the short timeframe).

Am I right thinking that if I use the current version of Modin + Ray, if I take a DataFrame (in RAM with say 100_000_000 rows) and instantiate a Modin Pandas DataFrame, then a Ray copy will be built? Using my ipython_memory_profiler to track RAM usage this seems to be the case.

import os
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
import modin.pandas as pd_md # setup modin to use Ray
import pandas as pd
import numpy as np

NBR = 100_000_000 # total RAM usage circa 200MB
df = pd.DataFrame({'a': np.random.uniform(0, 100, NBR), 'b': np.random.uniform(0, 100, NBR)})
df['c'] = df['a'] < 50
# costs 1.6GB, total RAM usage circa 1.8GB

df_md = pd_md.DataFrame(df)
# costs a further 2.6GB, total RAM usage circa 4.4GB

Does the above meet your expectation? I'm on an Intel 64 bit laptop, 32GB RAM, 4 i7 cores. Why would the Modin df cost 1.5x the RAM of the Pandas df?

With the above dataframe and some simple benchmarks I'm seeing Modin under-perform Pandas. These benchmarks are performed using %timeit in a Notebook, on multiple restarts of the kernel, without running out of RAM. I note for Modin in htop lots of "high red cpu" which indicates high kernel API calling, which might related to high memory traffic (but I've not diagnosed beyond that).

df['a'] + df['b'], Pandas circa 240ms, Modin circa 14s 
df['a'].median(), Pandas 2.9s, Modin 5s (Modin has wide variance on this, %timeit can report 3s-7s as its result!)
df['a'].isna(), Pandas 100ms, Modin 80ms (Modin consistently a bit faster)
df['a'].max(), Pandas 600ms, Modin 400ms (Modin consistently a bit faster)
df['a'].mean(), Pandas 611ms, Modin 4.5s
df.groupby('c').mean(), Pandas 2.4s, Modin 12s

The above benchmarks are just microbenchmarks on 100M rows of 3 columns of randomly generated data, they're certainly not clever but they weren't giving me the results I was expecting.

Can you give some guidance as to whether I'm seeing results you wouldn't expect? Perhaps 100M rows is too few and you'd suggest more?

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2020-04-23T17:16:49Z

Hi @ianozsvald thanks for the question!

A couple of notes on the Modin architecture at a high level then I can answer more specifics:

Modin is a column store, and can support billions+ of columns. What this means in practice is that more columns=more parallelism. When a particular operation is done on a single column, there is a limit to the parallelism possible based on this architecture.
Not everything is optimized yet, but we are working on them. Joins (df['a'] + df['b']), mean, median will all be improved with better optimization in the near future. They go through similar codepaths (except for joins of course). The same is true for the groupby.mean`. There are certainly lots of improvements in the works that I am excited about. Unfortunately engineering time has been the biggest bottleneck for these improvements, not limitations in the architecture.
Modin is optimized for ingesting data from file sources, the performance going from an in-memory numpy array or DataFrame can sometimes be worse.
Modin relies on Ray (in this case) for both execution and storage, so Ray's bottlenecks are our own. Modin stresses Ray harder than any other application, which may be why crashes were observed. I too have observed some crashes and report them to the Ray team.

Does the above meet your expectation? I'm on an Intel 64 bit laptop, 32GB RAM, 4 i7 cores. Why would the Modin df cost 1.5x the RAM of the Pandas df?

Ray is making a copy of the data into its object store, and then freeing the memory from the task. What you are seeing is probably those Python workers in the process of freeing all the memory. In general, memory is something we're working to optimize, and soon (by July) we will be decoupling the execution from the storage so we can optimize this better.

The above benchmarks are just microbenchmarks on 100M rows of 3 columns of randomly generated data, they're certainly not clever but they weren't giving me the results I was expecting.

A couple of observations:

All operations were done on a single Series. We have limited optimizations in place for single Series optimizations, most of the time users are not bottlenecked by these operations. We focused on optimizing the expensive parts of DataFrame computation: e.g. read_csv or expensive apply functions or functions on larger DataFrames. The center of our library is around making users more productive and we focused on the pain points.
I can reproduce your numbers specifically, but if I rewrite it slightly to be centered around the DataFrame, Modin is often faster (4-core i7 on MacOS as well):

# Wrote data to file to ingest with `read_csv`
df = pd.read_csv("100mx3.csv") # Pandas 35.4s, Modin 12.1s
df + df # Pandas 2s, Modin 3.55s
df.count() # Pandas 1.05s, Modin 57ms
df.isna() # Pandas 163ms, Modin 14.7ms
df.groupby("c").mean() # Pandas 4.2s, Modin 8.9s
df.groupby("c").count() # Pandas 3.65s, Modin 1.73s
pd.concat([df] * 5) # Pandas 8.38s, Modin 2.13s

count and mean or median (with or without groupby) will have similar communication patterns when we can optimize mean/median. Since Modin is such a new library, it will be possible to find places where we haven't had a chance to optimize yet, Series especially.

Sometimes it helps to consider the aggregate time in seconds as well, in the example I showed, Modin saved the user 26 seconds overall even though 2 operations in that list were slower. Users often think in real time where benchmarks often are worried about some factor (n times faster/slower, etc.). Since I have made (primarily interactive) users the focus of this library, I aim to lower the aggregate time, not be too concerned with shaving milliseconds off for better performance plots. Does 10x really matter to users if they only save 90 milliseconds? I argue no, users will not even notice this speedup.

Thanks for the question, let me know if I can help at all. If you would like some slides from previous presentations, always happy to provide. Email [email protected] and I will send any slides.

ianozsvald · 2020-04-24T08:15:49Z

Can you give an example of where using many columns would yield a good speedup that regular Pandas users would recognise? General Pandas use (in my experience) is for long dataframes with column based operations (which Dask might parallelize by row groups) and billions of columns (even 1000s) is less common, from what I've seen. How do operations like groupby or apply run faster with many columns? Maybe you're referring to running these operations row-wise (axis=1) rather than down a column? Here's my root question for my talk - if I'm to suggest "if you're using Pandas to do X and X is too slow, investigate Modin first before Dask or Vaex", what would X look like? I'm guessing single machine, data smaller than RAM, lots of RAM (your recent paper had 100s GBs), and possibly "lots of columns" (but I'd love to have a concrete example)? Many thanks if an answer is possible, Ian

…

On Thu, 23 Apr 2020, 18:17 Devin Petersohn, ***@***.***> wrote: Hi @ianozsvald <https://github.com/ianozsvald> thanks for the question! A couple of notes on the Modin architecture at a high level then I can answer more specifics: - Modin is a column store, and can support billions+ of columns. What this means in practice is that more columns=more parallelism. When a particular operation is done on a single column, there is a limit to the parallelism possible based on this architecture. - Not everything is optimized yet, but we are working on them. Joins (df['a'] + df['b']), mean, median will all be improved with better optimization in the near future. They go through similar codepaths (except for joins of course). The same is true for the groupby.mean`. There are certainly lots of improvements in the works that I am excited about. Unfortunately engineering time has been the biggest bottleneck for these improvements, not limitations in the architecture. - Modin is optimized for ingesting data from file sources, the performance going from an in-memory numpy array or DataFrame can sometimes be worse. - Modin relies on Ray (in this case) for both execution and storage, so Ray's bottlenecks are our own. Modin stresses Ray harder than any other application, which may be why crashes were observed. I too have observed some crashes and report them to the Ray team. Does the above meet your expectation? I'm on an Intel 64 bit laptop, 32GB RAM, 4 i7 cores. Why would the Modin df cost 1.5x the RAM of the Pandas df? Ray is making a copy of the data into its object store, and then freeing the memory from the task. What you are seeing is probably those Python workers in the process of freeing all the memory. In general, memory is something we're working to optimize, and soon (by July) we will be decoupling the execution from the storage so we can optimize this better. The above benchmarks are just microbenchmarks on 100M rows of 3 columns of randomly generated data, they're certainly not clever but they weren't giving me the results I was expecting. A couple of observations: - All operations were done on a single Series. We have limited optimizations in place for single Series optimizations, most of the time users are not bottlenecked by these operations. We focused on optimizing the expensive parts of DataFrame computation: e.g. read_csv or expensive apply functions or functions on larger DataFrames. The center of our library is around making users more productive and we focused on the pain points. - I can reproduce your numbers specifically, but if I rewrite it slightly to be centered around the DataFrame, Modin is often faster (4-core i7 on MacOS as well): # Wrote data to file to ingest with `read_csv` df = pd.read_csv("100mx3.csv") # Pandas 35.4s, Modin 12.1s df + df # Pandas 2s, Modin 3.55s df.count() # Pandas 1.05s, Modin 57ms df.isna() # Pandas 163ms, Modin 14.7ms df.groupby("c").mean() # Pandas 4.2s, Modin 8.9s df.groupby("c").count() # Pandas 3.65s, Modin 1.73s pd.concat([df] * 5) # Pandas 8.38s, Modin 2.13s count and mean or median (with or without groupby) will have similar communication patterns when we can optimize mean/median. Since Modin is such a new library, it will be possible to find places where we haven't had a chance to optimize yet, Series especially. Sometimes it helps to consider the aggregate time in seconds as well, in the example I showed, Modin saved the user 26 seconds overall even though 2 operations in that list were slower. Users often think in real time where benchmarks often are worried about some factor (n times faster/slower, etc.). Since I have made (primarily interactive) users the focus of this library, I aim to lower the aggregate time, not be too concerned with shaving milliseconds off for better performance plots. Does 10x really matter to users if they only save 90 milliseconds? I argue no, users will not even notice this speedup. Thanks for the question, let me know if I can help at all. If you would like some slides from previous presentations, always happy to provide. Email ***@***.*** and I will send any slides. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1390 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACCWOQQGPCVAWOJEPW6243ROBZZDANCNFSM4MO64RBQ> .

ianozsvald · 2020-04-24T11:02:25Z

Looking at the "Towards Scalable Dataframe Systems" paper I'm confused in "4.3 Preliminary Case Study". This section talks of "up to 1.6 billions rows" but doesn't mention how many columns are involved (but you state above that it is the columns that are parallelised, less so for the rows). Do you know how many columns you had?
In this section the "map" gives a reasonable speed-up (and presumably is using more than 1 column?).
The "groupby(n)" example gives a better speed-up but the text states "group by the non-null passenger_count column and count the number of rows in each group" - so surely that's just using 1 column for the groupby? How come it is so much faster than "map" (and I'm assuming you've got many columns for map, maybe I'm wrong?).
The "groupby(1)" result is not as great as "groupby(n)" but it sounds like it is checking all items on a row, so I'd have expected that to be faster (but that'll depend on the number of columns you have). Maybe it does more work due to more columns but has only an aggregation to perform (I see the note in the paper about 1 or many groups and shuffling)?
Apologies for the possibly naieve questions, maybe I've missed some detail in the paper?

ianozsvald · 2020-04-24T11:09:57Z

Also - thanks for providing and supporting the project! My questions are simply due to inquisitiveness (and by working alongside with Dask I'm about to raise some questions over there too). Cheers!

devin-petersohn · 2020-04-24T16:00:11Z

...so surely that's just using 1 column for the groupby? How come it is so much faster than "map" (and I'm assuming you've got many columns for map, maybe I'm wrong?).

In the paper the dataset we used had <30 columns. My comment was more along the lines of what we can support. Many operations require a transpose to be efficient (or to work). Dask cannot do transpose operations because their architecture does not permit it. Vaex also cannot support it. You can see that even in the 3 column example you provided, Modin outperforms. It is not about needing many columns to speed up, it is about operations on more than 1 column (right now).

For the groupby question specifically, the dataframe had 23 columns, and the groupby.count by pandas semantics is applied to each column individually. Since Modin can operate on columns in-parallel the 23 columns can be grouped independently. Grouping by 1 column does not mean computation along all columns cannot be done in-parallel.

I'm guessing single machine, data smaller than RAM, lots of RAM (your
recent paper had 100s GBs), and possibly "lots of columns" (but I'd love to
have a concrete example)?

Modin supports out of core (experimentally) so data need not fit in memory. Pandas cannot support out of core so that comparison was not made in the paper. The paper is not a Modin paper, it is a dataframe vision paper, so we focus less on the technical achievements of Modin (thus it is 1 page of 13+).

If you want a technical comparison against Dask DataFrame, I have an answer here: #515. Modin supports Dask compute engine.

Vaex requires files to be in HDF5 format prior to being able to do any querying. It appears that the HDF5 files must have gone through pandas, which limits the size of the file to what pandas can handle and can take an hour or more for larger files. They do not support the full pandas API, Index objects (row labels), and other more quirky pandas features and semantics. They are solving a more niche problem.

Both of these systems have deviated from the data model of the dataframe to achieve better performance. This is why Modin is not really competing with either of these systems, they are solving niche use cases by removing dataframe semantics to gain performance benefits/scale, but the problem Modin aims to solve is much more general. The comparison is apples to oranges because neither system supports true pandas semantics (it is always referred to as pandas-like). Modin is not aiming to solve the same problems as they are, Modin is solving a much more difficult problem for a much broader set of users.

It is human nature to want to compare, especially when everything these days gets branded as a dataframe, but we generally do not compare ourselves against those systems because we do not aim to target their user bases for our system, we want to solve pandas problems specifically.

Here's my root question for my talk - if I'm to suggest "if you're using
Pandas to do X and X is too slow, investigate Modin first before Dask or
Vaex", what would X look like?

Use Modin if you don't want to change your code. This resonates with data scientists and domain experts more than anything else.

ianozsvald · 2020-04-24T16:39:53Z

Thanks for the link and the additional notes, most useful.

ianozsvald · 2020-04-28T21:18:00Z

For completeness - my slides are here, the first event (Remote Pizza Python) is where I had a few slides on Modin: https://ianozsvald.com/2020/04/27/flying-pandas-and-making-pandas-fly-virtual-talks-this-weekend-on-faster-data-processing-with-pandas-modin-dask-and-vaex/
Thanks again for your help, I told attendees to come read this bug report if they wanted more insight into when they might use Modin.

devin-petersohn · 2020-04-28T21:33:48Z

Thanks @ianozsvald for the update, I am very happy to hear that your talk went well.

ianozsvald added the question ❓ Questions about Modin label Apr 23, 2020

ianozsvald closed this as completed Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions for a conference talk #1390

Questions for a conference talk #1390

ianozsvald commented Apr 23, 2020

devin-petersohn commented Apr 23, 2020

ianozsvald commented Apr 24, 2020 via email

ianozsvald commented Apr 24, 2020

ianozsvald commented Apr 24, 2020

devin-petersohn commented Apr 24, 2020

ianozsvald commented Apr 24, 2020

ianozsvald commented Apr 28, 2020

devin-petersohn commented Apr 28, 2020

Questions for a conference talk #1390

Questions for a conference talk #1390

Comments

ianozsvald commented Apr 23, 2020

devin-petersohn commented Apr 23, 2020

ianozsvald commented Apr 24, 2020 via email

ianozsvald commented Apr 24, 2020

ianozsvald commented Apr 24, 2020

devin-petersohn commented Apr 24, 2020

ianozsvald commented Apr 24, 2020

ianozsvald commented Apr 28, 2020

devin-petersohn commented Apr 28, 2020