-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Now message-box/bt class will restart thread if it was aborted because of an error. #101
Conversation
I'm wondering a bit. The condition handling in Actor-cell was meant to prevent that the message-box thread dies, on that level on the actor. This is intended behavior that nothing will happen to the message-box thread. An alternative behavior is a supervision, like Erlang and other actor frameworks do it. The thread handling you added in message-box goes in that direction. But honestly I'm not ready to change the behavior of this yet. Supervision is in the tickets as a TODO. But there is a lot more to it, so this needs thought through. I'm also wondering, the Saying all this, I was thinking, what is your exact use-case? Maybe those changes are all not needed? Effectively the message is handled by a pretty flat call-stack, there is effectively just the thread and I'm not sure if condition signalling will work as in non-event driven Common Lisp code because there is nowhere you could handle the signal except in the |
At least during debugging, when you insert a The sad side of this problem is that thread dies silently - actor continues to accept messages, it's queue grow but these messages aren't processed. |
Yeah, so the thread shouldn't die. With the |
I'm worrying a little that my current solution could slowdown performance of the library because it is check if thread is alive on each message push. I'll see if it possible to do such check if message handler unwinds stack unexpectedly. |
I'll make a minimal test to cjeck it. |
Yeah, the message-box loop is performance critical. From experience I can say that debugging (using the debugger) in a message driven context is difficult. Because you stop just one thread, but the rest is running normally and processing messages. Best is to use debug logging. |
I'd prefer to use a debugger, especially so powerful as in Common Lisp. It is a shame if we'll give up and fall back to "debugging by prints" :))) |
Please, see my latest commit. I've added a test The problem is still there, because handler-case does not catches non-local exits made by INVOKE-RESTART. |
By the way, do you have any performance benchmark? I'd like to use it to see if there is some degradation because of my changes. |
Another problem I've just found, is tests. They aren't running on CI because of this error:
but the PR "check" is still green. Probably this issue should be fixed too. I mean not tests running, but checking if they we run and if not, then failing the CI job. |
Well, debuggers in IntelliJ or Eclipse in Java world are equally powerful, if not more powerful. And yet, when working with massive asynchronous environments debuggers are mostly not useful. |
Yeah, definitely. No good if tests actually don't run. |
Maybe I don't understand. When do you do |
Imagine, that instead of HANDLER-BIND is a funcall to some function I gave to the actor to process his messages. In my current project actor is processing messages from Telegram bot API. Now, there is a lot of code in this telegram message handling and I want to debug some place (in my handler, not in the Sento itself), and I put a Debugger in emacs shows me a bunch of restarts:
If I hit That is the case I'm wanting to fix. |
9cc4683
to
3df36a7
Compare
(bt2:condition-wait withreply-cvar withreply-lock))) | ||
(cond | ||
(time-out | ||
(bt2:with-lock-held (withreply-lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to remove the duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move these three lines into a local function:
(log:trace "~a: pushing item to queue: ~a" (name msgbox) push-item)
(queue:pushq queue push-item)
(ensure-thread-is-running msgbox)
is it ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local function must still be redefined on runtime for each call of submit
, and then a function call in itself must safe and restore registers and so on.
We're just talking about a handful of lines of code. I would say it was OK as before with the factored-out common code and just leave the differences. Effectively we just have the additional (ensure-thread-is-running msgbox)
call, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this code and made these two branches outer for bt2:with-lock-held not because of (ensure-thread-is-running msgbox)
it is a fix of another problem I've encounter trying to write a test.
The problem is that this call:
(submit box "The Message" t 1
(list (lambda (msg)
(reverse msg))))
when you specify both withreply-p = t
and timeout != nil
, then this submit
call hangs. Why it hangs? Because of this old version of code in submit/reply
:
(bt2:with-lock-held (withreply-lock)
(log:trace "~a: pushing item to queue: ~a" (name msgbox) push-item)
(queue:pushq queue push-item)
(if time-out
(wait-and-probe-for-msg-handler-result msgbox push-item)
(bt2:condition-wait withreply-cvar withreply-lock)))
If time-out
is specified, then withreply-lock
is held during (wait-and-probe-for-msg-handler-result msgbox push-item)
. But process-queue-item
function called in the box's thread, also tries to acquire withreply-lock
(here). And it can't because the lock is already held by another thread. And because of this, box's thread can't execute handler and return a value which leads to a situation where (wait-and-probe-for-msg-handler-result)
is running during given timeout
seconds and then fails because didn't receive any results.
I'd call this situation an "almost dead-lock" :(
That is why I've made these two branches separate and for branch where time-out is given, we release lock before wait-and-probe-for-msg-handler-result
call, whereas in branch where timeout is NIL, lock is released by call to bt2:condition-wait
.
By the way, I've just found bt2:condition-wait
has it's own timeout
argument. Why you didn't use it instead of call to wait-and-probe-for-msg-handler-result
? Probably this code could be simpler, if we handle timeout error from bt2:condition-wait
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, indeed there is a problem when timeout is big, then message is only processed after wait time is over.
I need to wrap my head around this code. I'm wondering right now why not just remove the lock altogether and why it is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock is needed to use a condition variable. It's API requires lock to be used. Without condition you will have to use assert-cond with unlimited timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, got it I think.
So I think this code is OK. We can keep that bit of duplication on each branch.
(in-suite message-box-tests) | ||
|
||
|
||
(defun wait-while-thread-will-die (msgbox &key (timeout 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use one of the assert-cond
or await-cond
utils in miscutils
package.
I used bench.lisp as benchmark.
|
@@ -179,6 +198,23 @@ This function sets the result as `handler-result' in `item'. The return of this | |||
(bt2:condition-notify withreply-cvar))) | |||
(handler-fun))))) | |||
|
|||
|
|||
(declaim (ftype (function (message-box/bt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the declaim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently I've discovered that such type information sometimes helps to find issues during the compilation. So I started to add declarations to the code I'm touching in my own projects and also decided it would not harm to use here.
But if you wish, I'll remove it.
(bt2:with-lock-held (withreply-lock) | ||
(log:trace "~a: pushing item to queue: ~a" (name msgbox) push-item) | ||
(queue:pushq queue push-item) | ||
(ensure-thread-is-running msgbox) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add this to submit/no-reply
. And then, probably it's better to put it to submit
, to catch both ways of submitting.
In which case then, we indeed need the additional lock.
msgbox | ||
(bt2:with-lock-held (thread-lock) | ||
(unless (bt2:thread-alive-p queue-thread) | ||
(log:trace "Restarting thread ~A" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have this as a warn
logging to know exactly when it happens. Should be an exception.
I used bench.lisp as benchmark.
From what I measured it's roughly a difference of 4-5% using |
Current benchmark code uses |
OK, but keep in mind I use this bench for various Lisp impls and on some |
OK, thanks for explanation. |
Having said this, I'm mostly looking at what |
But this way it is hard to compare performance between runs or compare current performance with some historical value - the What I want to change:
This way it will be more convenient to see if there are some performance degradation because of my changes. |
By the way, did you experience heap exhaustion proble when running a benchmark? If I run it for more than 15-20 seconds, SBCL dies because for some reason it's garbage collection does not delete all garbage. I have limit 4G for the process, and if I run benchmark for 15 seconds, I'm wondering, why doesnt GC free more memory during the benchmark!? |
I also have seen this. That's why there is conditionally reduced load for SBCL in bench.lisp. |
Yeah, ok. For me it was sufficient to calculate the msg/sec from the data The most impact on time that I recognized is the queue implementation. I've experimented with a few. The current one in use is an implementation from the book "Programming Algorithms in Lisp" (see queue-locked.lisp). It requires a bit of more memory than the cons-queue used by lparallel, which is equally fast. |
After some research, I developed a hypothesis as to why the GC does not clean up memory. The point is that in the benchmark, N threads generate messages for a single actor. If the actor cannot process the messages fast enough, they accumulate in the queue. The test ends when all the messages in the queue are processed. When there are many messages in the queue and the GC is triggered, it sees that there are references to these messages and cannot clean them up, so it moves these objects into an older generation. The longer the queue is being processed during the object generation phase, the more such objects end up in the older generations of the garbage collector. When the test ends, there are no longer any references to the messages, but because the GC placed them in the older generations, it does not clean them up during regular runs, and they remain in memory. However, How did I figure this out? I’d like to say: "Very easily!" but no :( Initially, I decided to investigate the nature of the objects that remain in memory after the benchmark, and I wrote the following
This function retrieves a random object from memory and returns a weak pointer to it. Why a weak pointer? To avoid creating an additional reference to the object. It turned out that a significant portion of the objects are messages from the actor's queue:
Next, I tried to figure out whether any references to these objects were being held. For this, SBCL has a function for searching roots:
This is just an example. But for the objects created as a result of the benchmark, search-roots did not return anything, which indicated that these objects were "hanging in the air," and the GC could have removed them. Then I additionally tested my hypothesis that memory is not being freed because the queue is overloaded with too many objects. To do this, I modified the code that sends messages to actors so that every 10,000–20,000 messages, a (sleep 0.1) would occur. And this helped—the GC started cleaning up messages in a timely manner, and they stopped accumulating in the older generations of the GC. But what was even more surprising was that slowing down message generation led to an increase in the actor's throughput. Without the sleep, it processed about 777,000 messages per second, but with the sleep, it managed to process 821,000. This acceleration is likely due to the fact that with slower garbage generation, it does not accumulate in memory, and the GC spends less time collecting it. Without sleep:
With delayed message generation:
From this data, it is clear that although the non-GC time increased by a few seconds, the GC time decreased by an order of magnitude. Sometimes, slowing down leads to speeding up. That's how it is! |
I think, the benchmark should be modified the way when it will not fill actors queue with thousands of messages. What do you think? |
Nice findings. I think in regards to robustness of the test itself adding a small delay will be fine. The test case is pretty artificial anyways but just to give a baseline for comparison and to get a rough idea about the throughput. |
No, I tried to run the benchmark under SBCL compiled with new parallel GC, and it has the same problem (while being 15% slower), and under LispWorks - the same issue here, memory is not released after the benchmark until gc is invoked manually in full mode. |
OK, maybe. When I tried I didn’t have those issues with none of the other CL implementations (and I tried a few). SBCL had the most issues with garbage collection.
ABCL I thought was the most robust in terms of GC. Slow but robust. The JVM GC is not bad at all.
Anyway. Would be great to improve it.
… Am 18.01.2025 um 14:01 schrieb Alexander Artemenko ***@***.***>:
It seems though that GCs of other implementations can better handle that load.
No, I tried to run the benchmark under SBCL compiled with new parallel GC, and it has the same problem (while being 15% slower), and under LispWorks - the same issue here, memory is not released after the benchmark until gc is invoked manually in full mode.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
@mdbergmann I've made a new version of the fix in a separate PR: #103 |
No description provided.