Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running two instances of spark-perf ? #91

Open
msifalakis opened this issue Nov 27, 2015 · 3 comments
Open

running two instances of spark-perf ? #91

msifalakis opened this issue Nov 27, 2015 · 3 comments
Labels

Comments

@msifalakis
Copy link

I have been running 1 instance of the spark-perf benchmark using the core tests and the MLlib tests on a small-ish cluster (only 4 nodes, yet quite poweful ones -- 64GB ram, 8 cores each), using scale-factor=1.
The benchmark is occupying just 4 executors (1 core each).

Now I ve tried to launch a second scaled-down (0.1) configuration of the benchmark suite using only the core tests.. at the same time. Although there are both memory and executor/cores available, the benchmark fails to start! (or more precisely it fails to engage workers! .. giving me the following message
"Spark is still running on some slaves ... sleeping for 10 seconds"). That is even though I have set the USE_CLUSTER_SPARK = True, and RESTART_SPARK_CLUSTER = False -- so I guess it tries to use my existing cluster

On the other hand if I start a spark-shell or start another appl it seems to get admitted just fine!

Any ideas of what this means ? .. Given the very spartan information about what the benchmark does/uses it is rather difficult to know which direction to start looking at.

TIA

Manolis.

@JoshRosen
Copy link
Contributor

Hi @msifalakis,

My hunch is that this is a longstanding bug.

It wouldn't surprise me if nobody has tried running two instances of spark-perf at the same time on the same set of machines or on machines which run other non-spark-perf clusters. Whenever I've run this benchmark, I've done it on a dedicated set of EC2 machines which only run the spark-perf cluster.

If you'd like to try to fix this yourself, here's a few starting points:

The right fix is probably to figure out how to only monitor shutdown of executors associated with the previous test run, but this could be tricky to do.

@msifalakis
Copy link
Author

Hello Josh

Thanks for the pointers! Mostly helpful. I will have a look at them.

Meanwhile there is another relevant question I have, to which you may be
able to provide some quick pointers. Having looked at different parts of
the benchmark, i have only seen reported completion times (well not
exactly "the only" but mostly related to algorithm completion/accuracy).
Are there any options or places where one can dig in to find/enable system
metrics such as mem I/O, disk I/O, network bandwidth utilisation, CPU
occupancy, for individual tests/applications?

thanks again for the pointers and any further suggestions

Manolis.

From: Josh Rosen [email protected]
To: databricks/spark-perf [email protected]
Cc: msifalakis [email protected]
Date: 30/11/2015 21:12
Subject: Re: [spark-perf] running two instances of spark-perf ?
(#91)

Hi @msifalakis,
My hunch is that this is a longstanding bug.
It wouldn't surprise me if nobody has tried running two instances of
spark-perf at the same time on the same set of machines or on machines
which run other non-spark-perf clusters. Whenever I've run this benchmark,
I've done it on a dedicated set of EC2 machines which only run the
spark-perf cluster.
If you'd like to try to fix this yourself, here's a few starting points:
The log message that you saw comes from ensure_spark_stopped_on_slaves,
which seems to search for any executor backend, not just ones that belong
to the spark-perf cluster:

def ensure_spark_stopped_on_slaves(self):

In your case, this message probably came from
https://github.com/databricks/spark-perf/blame/79f8cfa6494e99a63f7cd4502aea4660b72ff6da/lib/sparkperf/testsuites.py#L79
. I believe that the intent behind that call was to ensure that executors
from one test driver were cleaned up before beginning a new test,
hopefully ensuring that subsequent tests are able to obtain the full
number of executors.
The right fix is probably to figure out how to only monitor shutdown of
executors associated with the previous test run, but this could be tricky
to do.
?
Reply to this email directly or view it on GitHub.

@JoshRosen
Copy link
Contributor

spark-perf itself does not contain support for collection of compute resource utilization metrics (memory, CPU, I/O). spark-ec2 clusters are launched with Ganglia installed, so it should be possible to pull metrics from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants