running two instances of spark-perf ? #91

msifalakis · 2015-11-27T14:22:29Z

I have been running 1 instance of the spark-perf benchmark using the core tests and the MLlib tests on a small-ish cluster (only 4 nodes, yet quite poweful ones -- 64GB ram, 8 cores each), using scale-factor=1.
The benchmark is occupying just 4 executors (1 core each).

Now I ve tried to launch a second scaled-down (0.1) configuration of the benchmark suite using only the core tests.. at the same time. Although there are both memory and executor/cores available, the benchmark fails to start! (or more precisely it fails to engage workers! .. giving me the following message
"Spark is still running on some slaves ... sleeping for 10 seconds"). That is even though I have set the USE_CLUSTER_SPARK = True, and RESTART_SPARK_CLUSTER = False -- so I guess it tries to use my existing cluster

On the other hand if I start a spark-shell or start another appl it seems to get admitted just fine!

Any ideas of what this means ? .. Given the very spartan information about what the benchmark does/uses it is rather difficult to know which direction to start looking at.

TIA

Manolis.

JoshRosen · 2015-11-30T20:12:24Z

Hi @msifalakis,

My hunch is that this is a longstanding bug.

It wouldn't surprise me if nobody has tried running two instances of spark-perf at the same time on the same set of machines or on machines which run other non-spark-perf clusters. Whenever I've run this benchmark, I've done it on a dedicated set of EC2 machines which only run the spark-perf cluster.

If you'd like to try to fix this yourself, here's a few starting points:

The log message that you saw comes from ensure_spark_stopped_on_slaves, which seems to search for any executor backend, not just ones that belong to the spark-perf cluster:

spark-perf/lib/sparkperf/cluster.py

Line 46 in 79f8cfa

def ensure_spark_stopped_on_slaves(self):
In your case, this message probably came from https://github.com/databricks/spark-perf/blame/79f8cfa6494e99a63f7cd4502aea4660b72ff6da/lib/sparkperf/testsuites.py#L79. I believe that the intent behind that call was to ensure that executors from one test driver were cleaned up before beginning a new test, hopefully ensuring that subsequent tests are able to obtain the full number of executors.

The right fix is probably to figure out how to only monitor shutdown of executors associated with the previous test run, but this could be tricky to do.

msifalakis · 2015-12-01T13:05:40Z

Hello Josh

Thanks for the pointers! Mostly helpful. I will have a look at them.

Meanwhile there is another relevant question I have, to which you may be
able to provide some quick pointers. Having looked at different parts of
the benchmark, i have only seen reported completion times (well not
exactly "the only" but mostly related to algorithm completion/accuracy).
Are there any options or places where one can dig in to find/enable system
metrics such as mem I/O, disk I/O, network bandwidth utilisation, CPU
occupancy, for individual tests/applications?

thanks again for the pointers and any further suggestions

Manolis.

From: Josh Rosen [email protected]
To: databricks/spark-perf [email protected]
Cc: msifalakis [email protected]
Date: 30/11/2015 21:12
Subject: Re: [spark-perf] running two instances of spark-perf ?
(#91)

Hi @msifalakis,
My hunch is that this is a longstanding bug.
It wouldn't surprise me if nobody has tried running two instances of
spark-perf at the same time on the same set of machines or on machines
which run other non-spark-perf clusters. Whenever I've run this benchmark,
I've done it on a dedicated set of EC2 machines which only run the
spark-perf cluster.
If you'd like to try to fix this yourself, here's a few starting points:
The log message that you saw comes from ensure_spark_stopped_on_slaves,
which seems to search for any executor backend, not just ones that belong
to the spark-perf cluster:

spark-perf/lib/sparkperf/cluster.py

Line 46 in 79f8cfa

def ensure_spark_stopped_on_slaves(self):

In your case, this message probably came from
https://github.com/databricks/spark-perf/blame/79f8cfa6494e99a63f7cd4502aea4660b72ff6da/lib/sparkperf/testsuites.py#L79
. I believe that the intent behind that call was to ensure that executors
from one test driver were cleaned up before beginning a new test,
hopefully ensuring that subsequent tests are able to obtain the full
number of executors.
The right fix is probably to figure out how to only monitor shutdown of
executors associated with the previous test run, but this could be tricky
to do.
?
Reply to this email directly or view it on GitHub.

JoshRosen · 2015-12-01T18:33:25Z

spark-perf itself does not contain support for collection of compute resource utilization metrics (memory, CPU, I/O). spark-ec2 clusters are launched with Ganglia installed, so it should be possible to pull metrics from there.

JoshRosen added the question label Nov 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running two instances of spark-perf ? #91

running two instances of spark-perf ? #91

msifalakis commented Nov 27, 2015

JoshRosen commented Nov 30, 2015

msifalakis commented Dec 1, 2015

JoshRosen commented Dec 1, 2015

running two instances of spark-perf ? #91

running two instances of spark-perf ? #91

Comments

msifalakis commented Nov 27, 2015

JoshRosen commented Nov 30, 2015

msifalakis commented Dec 1, 2015

JoshRosen commented Dec 1, 2015