syncRoot barrier does not block for unfulfilled pledges #97

mratsim · 2020-01-12T21:30:54Z

syncRoot does not take into account unfulfilled pledges.

This is related to workers being able to send termination even though they have a task delayed by a pledge.
This can be avoided by workers keeping a local count of the delayed tasks they have pending.

See: https://dev.azure.com/numforge/Weave/_build/results?buildId=288&view=logs&j=6b0d97ed-2246-525a-6e2b-9532bade852d&t=90276eb0-5b44-500a-95e7-4e793047d715

mratsim · 2020-01-12T21:36:52Z

In particular in this finite state automaton, a new event + transition or changing the MyTreeIsIdle event to also check for delayed tasks should be sufficient:

weave/weave/stealing_fsm.nim

Lines 107 to 136 in 3bfb641

    
           # Decline your own steal request 
        
           # ------------------------------------------- 
        
           onEntry(declineReqFSA, DS_TreeIdle): 
        
             # debug: 
        
             #   log("Worker %2d: received own request (req.state: %s, left (%d): %s, right (%d): %s)\n", 
        
             #     myID(), $req.state, 
        
             #     myWorker().left, 
        
             #     if myWorker().leftIsWaiting: "waiting" else: "not waiting", 
        
             #     myWorker().right, 
        
             #     if myWorker().rightIsWaiting: "waiting" else: "not waiting") 
        
             discard 
        
           implEvent(declineReqFSA, DSE_MyTreeIsIdle): 
        
             req.state == Stealing and myWorker().leftIsWaiting and myWorker().rightIsWaiting 
        
           behavior(declineReqFSA): 
        
             ini: DS_MyOwnReq 
        
             event: DSE_MyTreeIsIdle 
        
             transition: discard 
        
             fin: DS_TreeIdle 
        
           behavior(declineReqFSA): 
        
             ini: DS_MyOwnReq 
        
             transition: 
        
               # Our own request but we still have work, so we reset it and recirculate. 
        
               ascertain: req.victims.capacity.int32 == workforce() 
        
               req.retry = 0 
        
               req.victims.refill() 
        
             fin: DS_FindVictimAndSteal

* Try to isolate the recurrent nil pledge bug on Travis MacOS * Remove some weird delaying logic * Try to sense if Travis bug is linked to #97 * more verbosity * nextDep wasn't properly zero-initialized * remove a repr * missed another "nil" * Cleanup investigative changes

… spawnDelayed #97 #115)

mratsim · 2020-04-26T14:00:02Z

Actually this is not so simple.

If a task is delayed, its owner is the Pledge handle and no threads should access it. It means that its lifetime is not tied to a thread and so a thread cannot count the delayed tasks reliable.
Once a pledge is fulfilled, all linked delayed tasks are scheduled immediately but there is no simple way for the thread that fulfilled the pledge to notify the creator thread to decrease its count.

That said:

if a pledge is never fulfilled that is an user error. The runtime will just continue forward and never process the tasks associated and never reclaim the memory as well
if there is no such error, that means that there is a thread making progress on the pledges, so the main thread should be blocked at sync_root, i.e. a worker can send that it is idle after delaying a task, but there is at least one worker in the runtime that will not.

So where is the bug?

A sanity check would be for each worker to count the delayed tasks created and processed. On idle, they send that count to their parent. We can then assert in the syncRootBarrier that when the runtime is quiescent the delayed tasks created/process counts are the same.

* model checking - 1st try to fix MPSC queue (the model checker crashes with not enough memory :/) * Give the thread the opportunity to not deadlock on sleep on Mac/with Clang * whoopsie * Add impl of Weave MPSC channel in C++ for CDSChecker model checking + comment out fences * Comment out GEMM tests for syncRoot + Pledges: #97 * don't use sleep, it's can deadlock in the CI ... * Try get epoch time to avoid mac bugs * use `getTime` and hope that it's properly implemented on Mac * State-machine, return to CheckTask to avoid leaving task spawning multitasks in queue (followup #121) * Don't spinlock for testing, deadlocks ARM and OSX * Could it be non-mono clock jitter? * Add some log for MacOS debug * Race condition between spawning the thread and entering the spinlock in the `isReady` test * a,d obviously I mess up the function call

mratsim added the bug 🪲 Something isn't working label Jan 12, 2020

mratsim mentioned this issue Jan 12, 2020

Reduce mpsc padding #96

Merged

mratsim added the good first issue 🔧 Good for newcomers label Jan 12, 2020

mratsim added a commit that referenced this issue Apr 4, 2020

Try to sense if Travis bug is linked to #97

2cbd224

This was referenced Apr 25, 2020

Scoped barriers #111

Merged

sync bug in matmul test #115

Open

mratsim added a commit that referenced this issue Apr 26, 2020

deactivate test with the non-nestable gemm (syncRoot doesn't lock for…

f8c8071

… spawnDelayed #97 #115)

mratsim mentioned this issue Apr 26, 2020

Fuzzing #120

Open

mratsim removed the good first issue 🔧 Good for newcomers label Apr 26, 2020

mratsim added a commit that referenced this issue May 9, 2020

Comment out GEMM tests for syncRoot + Pledges: #97

da5ae0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syncRoot barrier does not block for unfulfilled pledges #97

syncRoot barrier does not block for unfulfilled pledges #97

mratsim commented Jan 12, 2020

mratsim commented Jan 12, 2020

mratsim commented Apr 26, 2020

syncRoot barrier does not block for unfulfilled pledges #97

syncRoot barrier does not block for unfulfilled pledges #97

Comments

mratsim commented Jan 12, 2020

mratsim commented Jan 12, 2020

mratsim commented Apr 26, 2020