-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions for a conference talk #1390
Comments
Hi @ianozsvald thanks for the question! A couple of notes on the Modin architecture at a high level then I can answer more specifics:
Ray is making a copy of the data into its object store, and then freeing the memory from the task. What you are seeing is probably those Python workers in the process of freeing all the memory. In general, memory is something we're working to optimize, and soon (by July) we will be decoupling the execution from the storage so we can optimize this better.
A couple of observations:
Sometimes it helps to consider the aggregate time in seconds as well, in the example I showed, Modin saved the user 26 seconds overall even though 2 operations in that list were slower. Users often think in real time where benchmarks often are worried about some factor (n times faster/slower, etc.). Since I have made (primarily interactive) users the focus of this library, I aim to lower the aggregate time, not be too concerned with shaving milliseconds off for better performance plots. Does 10x really matter to users if they only save 90 milliseconds? I argue no, users will not even notice this speedup. Thanks for the question, let me know if I can help at all. If you would like some slides from previous presentations, always happy to provide. Email [email protected] and I will send any slides. |
Can you give an example of where using many columns would yield a good
speedup that regular Pandas users would recognise?
General Pandas use (in my experience) is for long dataframes with column
based operations (which Dask might parallelize by row groups) and billions
of columns (even 1000s) is less common, from what I've seen.
How do operations like groupby or apply run faster with many columns?
Maybe you're referring to running these operations row-wise (axis=1) rather
than down a column?
Here's my root question for my talk - if I'm to suggest "if you're using
Pandas to do X and X is too slow, investigate Modin first before Dask or
Vaex", what would X look like?
I'm guessing single machine, data smaller than RAM, lots of RAM (your
recent paper had 100s GBs), and possibly "lots of columns" (but I'd love to
have a concrete example)?
Many thanks if an answer is possible, Ian
…On Thu, 23 Apr 2020, 18:17 Devin Petersohn, ***@***.***> wrote:
Hi @ianozsvald <https://github.com/ianozsvald> thanks for the question!
A couple of notes on the Modin architecture at a high level then I can
answer more specifics:
- Modin is a column store, and can support billions+ of columns. What
this means in practice is that more columns=more parallelism. When a
particular operation is done on a single column, there is a limit to the
parallelism possible based on this architecture.
- Not everything is optimized yet, but we are working on them. Joins (df['a']
+ df['b']), mean, median will all be improved with better optimization in
the near future. They go through similar codepaths (except for joins of
course). The same is true for the groupby.mean`. There are certainly
lots of improvements in the works that I am excited about. Unfortunately
engineering time has been the biggest bottleneck for these improvements,
not limitations in the architecture.
- Modin is optimized for ingesting data from file sources, the
performance going from an in-memory numpy array or DataFrame can sometimes
be worse.
- Modin relies on Ray (in this case) for both execution and storage,
so Ray's bottlenecks are our own. Modin stresses Ray harder than any other
application, which may be why crashes were observed. I too have observed
some crashes and report them to the Ray team.
Does the above meet your expectation? I'm on an Intel 64 bit laptop, 32GB
RAM, 4 i7 cores. Why would the Modin df cost 1.5x the RAM of the Pandas df?
Ray is making a copy of the data into its object store, and then freeing
the memory from the task. What you are seeing is probably those Python
workers in the process of freeing all the memory. In general, memory is
something we're working to optimize, and soon (by July) we will be
decoupling the execution from the storage so we can optimize this better.
The above benchmarks are just microbenchmarks on 100M rows of 3 columns of
randomly generated data, they're certainly not clever but they weren't
giving me the results I was expecting.
A couple of observations:
- All operations were done on a single Series. We have limited
optimizations in place for single Series optimizations, most of the time
users are not bottlenecked by these operations. We focused on optimizing
the expensive parts of DataFrame computation: e.g. read_csv or
expensive apply functions or functions on larger DataFrames. The
center of our library is around making users more productive and we focused
on the pain points.
- I can reproduce your numbers specifically, but if I rewrite it
slightly to be centered around the DataFrame, Modin is often faster (4-core
i7 on MacOS as well):
# Wrote data to file to ingest with `read_csv`
df = pd.read_csv("100mx3.csv") # Pandas 35.4s, Modin 12.1s
df + df # Pandas 2s, Modin 3.55s
df.count() # Pandas 1.05s, Modin 57ms
df.isna() # Pandas 163ms, Modin 14.7ms
df.groupby("c").mean() # Pandas 4.2s, Modin 8.9s
df.groupby("c").count() # Pandas 3.65s, Modin 1.73s
pd.concat([df] * 5) # Pandas 8.38s, Modin 2.13s
count and mean or median (with or without groupby) will have similar
communication patterns when we can optimize mean/median. Since Modin is
such a new library, it will be possible to find places where we haven't had
a chance to optimize yet, Series especially.
Sometimes it helps to consider the aggregate time in seconds as well, in
the example I showed, Modin saved the user 26 seconds overall even though 2
operations in that list were slower. Users often think in real time where
benchmarks often are worried about some factor (n times faster/slower,
etc.). Since I have made (primarily interactive) users the focus of this
library, I aim to lower the aggregate time, not be too concerned with
shaving milliseconds off for better performance plots. Does 10x really
matter to users if they only save 90 milliseconds? I argue no, users will
not even notice this speedup.
Thanks for the question, let me know if I can help at all. If you would
like some slides from previous presentations, always happy to provide.
Email ***@***.*** and I will send any slides.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1390 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACCWOQQGPCVAWOJEPW6243ROBZZDANCNFSM4MO64RBQ>
.
|
Looking at the "Towards Scalable Dataframe Systems" paper I'm confused in "4.3 Preliminary Case Study". This section talks of "up to 1.6 billions rows" but doesn't mention how many columns are involved (but you state above that it is the columns that are parallelised, less so for the rows). Do you know how many columns you had? |
Also - thanks for providing and supporting the project! My questions are simply due to inquisitiveness (and by working alongside with Dask I'm about to raise some questions over there too). Cheers! |
In the paper the dataset we used had <30 columns. My comment was more along the lines of what we can support. Many operations require a transpose to be efficient (or to work). Dask cannot do transpose operations because their architecture does not permit it. Vaex also cannot support it. You can see that even in the 3 column example you provided, Modin outperforms. It is not about needing many columns to speed up, it is about operations on more than 1 column (right now). For the
Modin supports out of core (experimentally) so data need not fit in memory. Pandas cannot support out of core so that comparison was not made in the paper. The paper is not a Modin paper, it is a dataframe vision paper, so we focus less on the technical achievements of Modin (thus it is 1 page of 13+). If you want a technical comparison against Dask DataFrame, I have an answer here: #515. Modin supports Dask compute engine. Vaex requires files to be in HDF5 format prior to being able to do any querying. It appears that the HDF5 files must have gone through pandas, which limits the size of the file to what pandas can handle and can take an hour or more for larger files. They do not support the full pandas API, Index objects (row labels), and other more quirky pandas features and semantics. They are solving a more niche problem. Both of these systems have deviated from the data model of the dataframe to achieve better performance. This is why Modin is not really competing with either of these systems, they are solving niche use cases by removing dataframe semantics to gain performance benefits/scale, but the problem Modin aims to solve is much more general. The comparison is apples to oranges because neither system supports true pandas semantics (it is always referred to as pandas-like). Modin is not aiming to solve the same problems as they are, Modin is solving a much more difficult problem for a much broader set of users. It is human nature to want to compare, especially when everything these days gets branded as a dataframe, but we generally do not compare ourselves against those systems because we do not aim to target their user bases for our system, we want to solve pandas problems specifically.
Use Modin if you don't want to change your code. This resonates with data scientists and domain experts more than anything else. |
Thanks for the link and the additional notes, most useful. |
For completeness - my slides are here, the first event (Remote Pizza Python) is where I had a few slides on Modin: https://ianozsvald.com/2020/04/27/flying-pandas-and-making-pandas-fly-virtual-talks-this-weekend-on-faster-data-processing-with-pandas-modin-dask-and-vaex/ |
Thanks @ianozsvald for the update, I am very happy to hear that your talk went well. |
Hello, I'm giving a conference talk on Saturday, I'm spending a few slides on Modin. I have little practical experience, I've dabbled in recent months and my current benchmarks are confusing. I've read "Towards Scalable Dataframe Systems" (2020), "Scaling Interactive Data Science Transparently with Modin" (2018) and your site.
I've also queried colleagues and of the few that have used Modin their advice has been quite mixed (some had speed-ups, some had slow-downs, some had crashes). I'd like to tell a sensible story so I have a couple of questions, if you have any input it'd be very much appreciated (an apologies for the short timeframe).
Am I right thinking that if I use the current version of Modin + Ray, if I take a DataFrame (in RAM with say 100_000_000 rows) and instantiate a Modin Pandas DataFrame, then a Ray copy will be built? Using my
ipython_memory_profiler
to track RAM usage this seems to be the case.Does the above meet your expectation? I'm on an Intel 64 bit laptop, 32GB RAM, 4 i7 cores. Why would the Modin df cost 1.5x the RAM of the Pandas df?
With the above dataframe and some simple benchmarks I'm seeing Modin under-perform Pandas. These benchmarks are performed using
%timeit
in a Notebook, on multiple restarts of the kernel, without running out of RAM. I note for Modin inhtop
lots of "high red cpu" which indicates high kernel API calling, which might related to high memory traffic (but I've not diagnosed beyond that).The above benchmarks are just microbenchmarks on 100M rows of 3 columns of randomly generated data, they're certainly not clever but they weren't giving me the results I was expecting.
Can you give some guidance as to whether I'm seeing results you wouldn't expect? Perhaps 100M rows is too few and you'd suggest more?
The text was updated successfully, but these errors were encountered: