-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unintentional synchronization points? #41
Comments
FWIW: attaching gdb and grabbing a
|
Interesting. How long does such a stall take? I'm quite often running with -n 64 one a machine and it seems fine.
What I do is starting
Anyway, I think it's a stall in Pebble library and it would be worth creating a reproducer for it and report it to upstream. |
I'm not sure if this is related, but for me the step "UnIfDefPass" does sometimes stall and does not progress. Even though that step has nothing to do. After re-running the script several times, it will eventually skip that step. The timeout does not seem to interrupt. I'm calling cvise like this:
I'm not on the newest version right now (cvise 2.2.0 (4c7f4cd)), so I'll first update and reevaluate the situation. |
It's likely something different. Can you please paste output of cvise with
That's likely not a change since |
And please open a new issue for it. |
Right, the call to noxdafox/pebble#50 looks related, though it sounds like that was fixed in 4.5.1. pip3 list | grep Pebble
Pebble 4.6.1 Otherwise noxdafox/pebble#62 and noxdafox/pebble#33 look potentially related. One thing I found interesting, grepping for the parent pid: $ ps -ef | grep 1052950
ndesaul+ 1052950 3884808 2 14:44 pts/4 00:02:06 python3 /android1/cvise/build/cvise.py --n 71 ./repro.sh extable.i
ndesaul+ 1088933 1052950 2 14:46 pts/4 00:02:00 python3 /android1/cvise/build/cvise.py --n 71 ./repro.sh extable.i
ndesaul+ 1368039 1052950 0 15:55 pts/4 00:00:00 python3 /android1/cvise/build/cvise.py --n 71 ./repro.sh extable.i makes it look like the one child has been running for 2 minutes? and the rest for 0? |
If I drop |
moving from 1a9f47f to 21c4e26 seems to have helped. no
(previously, reducing the same input with the older version of cvise took over 1hr10min)
here's the python trace of the long running child (
and an idle child
Am I understanding correctly that pebble is blocking for 60s if it fails to acquire a lock? |
so no |
^ so 36 parallel tests
I should mention this is a dual socket system; I wonder if synchronization overhead is much much worse due to NUMA? Perhaps setting affinity to one socket would help? Perhaps if I play with numactl... $ numactl --cpunodebind=0 /android1/cvise/build/cvise.py ./repro.sh extable.i
00:15:20 INFO ===================== done ==================== so that is slightly faster, though I guess by nature these are probably non-deterministic results. I didn't observe any pauses that run. If I double the number of threads
I definitely think pebble's 60 timeout is related. If I hack up
diff --git a/cvise/utils/testing.py b/cvise/utils/testing.py
index 5deb23e7d60b..47edd950206c 100644
--- a/cvise/utils/testing.py
+++ b/cvise/utils/testing.py
@@ -427,6 +427,7 @@ class TestManager:
def run_parallel_tests(self):
assert not self.futures
assert not self.temporary_folders
+ pebble.process.channel.LOCK_TIMEOUT = 0.1
with pebble.ProcessPool(max_workers=self.parallel_tests) as pool:
order = 1
self.timeout_count = 0
actually, this is probably about the fastest I can run: diff --git a/cvise/utils/testing.py b/cvise/utils/testing.py
index 5deb23e7d60b..e697bd54b0bf 100644
--- a/cvise/utils/testing.py
+++ b/cvise/utils/testing.py
@@ -421,8 +421,11 @@ class TestManager:
@classmethod
def terminate_all(cls, pool):
+ temp = pebble.process.channel.LOCK_TIMEOUT
+ pebble.process.channel.LOCK_TIMEOUT = 0.001
pool.stop()
pool.join()
+ pebble.process.channel.LOCK_TIMEOUT = temp
def run_parallel_tests(self):
assert not self.futures $ numactl --cpunodebind=0 /android1/cvise/build/cvise.py ./repro.sh extable.i
00:15:26 INFO ===================== done ==================== so there's probably 2 things we can do immediately:
I'm curious. It seems like cvise starts N workers, then terminates them all once the first has found whether its mutated input is interesting? So it seems like we get killed in synchronization overhead for that and don't make much forward progress the larger N gets. It also seems like it probably throws out a lot of forward progress each thread may have made individually. I wonder if you could instead have each thread generate a diff when it found something interesting, then during synchronization apply all of the diffs all of the workers may have found in a round? I'm also curious if it makes more sense to dynamically scale the number of workers based on how many are finding inputs interesting over time? If most mutations are generally interesting, then having more than 1 worker is probably overkill. If most mutations are generally uninteresting, having as many workers working on solutions is probably profitable. I'm also curious if we're hitting a priority inversion between the parent and child threads? |
Thank you very much for the nice investigation!
I do welcome the patch, please make a separate pull request for it. Moreover, we can likely notify
Support that as well. |
Once the first one finds something interesting, we wait for all jobs that started before and kill all after it.
Yes, that can possibly happen.
I like the idea which is basically about combine all results and run interestingness test for it.
Yep, we can dynamically calculate the success ratio and adapt workers. Or we can at least add '+', '-' keyboard shortcuts one can use to double the size of workers?
I don't understand it, can you please explain it? Anyway, great observations you made. |
I need to play with my patch more. I thought I had it solved last night, but if I crank up |
Perhaps the Python bindings to libuv might help. |
heh, running cvise on my dual core hyper threaded laptop is faster than my 72 core xeon workstation...same reproducer, same inputs, same compiler under test... workstation:
laptop:
|
Do you mean using it in the |
Or swap Pebble for libuv. I wonder if we can speed things up by sending sigterm to processes we want to terminate. |
But does it provide a Process pool, I can't find it. About sigterm - can you please investigate Pebble which signal is used when a future is canceled? |
Hi. Note there's a recent Pebble Does the problem still persist? If so, please provide a reproducer I can play with. |
I've recently moved to a newer workstation; one that's not dual socket, so I might no longer be able to reproduce. I'll report back once I've had another opportunity to run cvise. Besides, I suspect a reproducer would need to be run on a NUMA machine... |
All right, thanks! |
When I run cvise, sometimes it seems stalled on making progress. When I check
htop
, it doesn't look like my cores are doing any work related to cvise. It looks like00:00:03 INFO ===< LinesPass::0 >===
is what's running. I'm running cvise with
--n 71
. Are there synchronization points where all threads need to wait on one to make progress? Bisection will eventually converge, but it seems to take quite some time to make forward progress.Playing with
ps -ef
, it looks like there are 71 instances of the python interpreter running, with 0 instances of clang, clang_delta, or my reproducer shell script.The text was updated successfully, but these errors were encountered: