Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncRoot barrier does not block for unfulfilled pledges #97

Open
mratsim opened this issue Jan 12, 2020 · 2 comments
Open

syncRoot barrier does not block for unfulfilled pledges #97

mratsim opened this issue Jan 12, 2020 · 2 comments
Labels
bug 🪲 Something isn't working

Comments

@mratsim
Copy link
Owner

mratsim commented Jan 12, 2020

syncRoot does not take into account unfulfilled pledges.

This is related to workers being able to send termination even though they have a task delayed by a pledge.
This can be avoided by workers keeping a local count of the delayed tasks they have pending.

See: https://dev.azure.com/numforge/Weave/_build/results?buildId=288&view=logs&j=6b0d97ed-2246-525a-6e2b-9532bade852d&t=90276eb0-5b44-500a-95e7-4e793047d715

@mratsim mratsim added the bug 🪲 Something isn't working label Jan 12, 2020
@mratsim
Copy link
Owner Author

mratsim commented Jan 12, 2020

In particular in this finite state automaton, a new event + transition or changing the MyTreeIsIdle event to also check for delayed tasks should be sufficient:

# Decline your own steal request
# -------------------------------------------
onEntry(declineReqFSA, DS_TreeIdle):
# debug:
# log("Worker %2d: received own request (req.state: %s, left (%d): %s, right (%d): %s)\n",
# myID(), $req.state,
# myWorker().left,
# if myWorker().leftIsWaiting: "waiting" else: "not waiting",
# myWorker().right,
# if myWorker().rightIsWaiting: "waiting" else: "not waiting")
discard
implEvent(declineReqFSA, DSE_MyTreeIsIdle):
req.state == Stealing and myWorker().leftIsWaiting and myWorker().rightIsWaiting
behavior(declineReqFSA):
ini: DS_MyOwnReq
event: DSE_MyTreeIsIdle
transition: discard
fin: DS_TreeIdle
behavior(declineReqFSA):
ini: DS_MyOwnReq
transition:
# Our own request but we still have work, so we reset it and recirculate.
ascertain: req.victims.capacity.int32 == workforce()
req.retry = 0
req.victims.refill()
fin: DS_FindVictimAndSteal

@mratsim mratsim added the good first issue 🔧 Good for newcomers label Jan 12, 2020
mratsim added a commit that referenced this issue Apr 4, 2020
* Try to isolate the recurrent nil pledge bug on Travis MacOS

* Remove some weird delaying logic

* Try to sense if Travis bug is linked to #97

* more verbosity

* nextDep wasn't properly zero-initialized

* remove a repr

* missed another "nil"

* Cleanup investigative changes
This was referenced Apr 25, 2020
@mratsim mratsim mentioned this issue Apr 26, 2020
@mratsim mratsim removed the good first issue 🔧 Good for newcomers label Apr 26, 2020
@mratsim
Copy link
Owner Author

mratsim commented Apr 26, 2020

Actually this is not so simple.

If a task is delayed, its owner is the Pledge handle and no threads should access it. It means that its lifetime is not tied to a thread and so a thread cannot count the delayed tasks reliable.
Once a pledge is fulfilled, all linked delayed tasks are scheduled immediately but there is no simple way for the thread that fulfilled the pledge to notify the creator thread to decrease its count.

That said:

  • if a pledge is never fulfilled that is an user error. The runtime will just continue forward and never process the tasks associated and never reclaim the memory as well
  • if there is no such error, that means that there is a thread making progress on the pledges, so the main thread should be blocked at sync_root, i.e. a worker can send that it is idle after delaying a task, but there is at least one worker in the runtime that will not.

So where is the bug?

A sanity check would be for each worker to count the delayed tasks created and processed. On idle, they send that count to their parent. We can then assert in the syncRootBarrier that when the runtime is quiescent the delayed tasks created/process counts are the same.

mratsim added a commit that referenced this issue May 9, 2020
* model checking - 1st try to fix MPSC queue (the model checker crashes with not enough memory :/)

* Give the thread the opportunity to not deadlock on sleep on Mac/with Clang

* whoopsie

* Add impl of Weave MPSC channel in C++ for CDSChecker model checking + comment out fences

* Comment out GEMM tests for syncRoot + Pledges: #97

* don't use sleep, it's can deadlock in the CI ...

* Try get epoch time to avoid mac bugs

* use `getTime` and hope that it's properly implemented on Mac

* State-machine, return to CheckTask to avoid leaving task spawning multitasks in queue (followup #121)

* Don't spinlock for testing, deadlocks ARM and OSX

* Could it be non-mono clock jitter?

* Add some log for MacOS debug

* Race condition between spawning the thread and entering the spinlock in the `isReady` test

* a,d obviously I mess up the function call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant