Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

got a corrupted external term #68

Open
zwizwa opened this issue Apr 12, 2020 · 25 comments · May be fixed by #69
Open

got a corrupted external term #68

zwizwa opened this issue Apr 12, 2020 · 25 comments · May be fixed by #69

Comments

@zwizwa
Copy link

zwizwa commented Apr 12, 2020

Hi,

I keep running into this error and I'm a bit lost regarding how to start debugging it. I'm not quite sure how it happens, but in very vague terms it seems to happen when I send too much logging data from Erlang to Emacs. Once this error triggers, the distel communication gets into a bad state, and the only way I've found to fix it is to restart emacs.

=WARNING REPORT==== 12-Apr-2020::10:37:00 ===
'[email protected]' got a corrupted external term from 'distel_27211@panda' on distribution channel 28280
<<...,131,104,4,97,6,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,188,139,0,0,0,0,0,100,0,0,100,0,3,114,101,120,131,104,2,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,188,139,0,0,0,0,0,104,5,100,0,4,99,97,108,108,100,0,4,99,111,100,101,100,0,13,101,110,115,117,114,101,95,108,111,97,100,101,100,108,0,0,0,1,100,0,6,100,105,115,116,101,108,106,103,100,0,18,100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97,0,0,0,1,0,0,0,0,0>>
ATOM_CACHE_REF translations: none

@zwizwa
Copy link
Author

zwizwa commented Apr 18, 2020

Taking that apart gives a term

Term = <<
131, %% ETF
104,4, %% SMALL_TUPLE_EXT
97,6,%% SMALL_INTEGER_EXT
103, %% PID_EXT
100, %% ATOM_EXT
0,18,
100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97, %% Node: "distel_27211@panda"
0,0,188,139, %% ID
0,0,0,0, %% Serial
0, %% Creation
100, %% ATOM_EXT
0,0,
100, %% ATOM_EXT
0,3,
114,101,120>>.

followed by another term
131, ....

binary_to_term(Term). gives exception error: bad argument, as does binary_to_term(Pid) below, so it seems to be the PID_EXT.

Pid = <<
131,
103, %% PID_EXT
100, %% ATOM_EXT
0,18,
100,105,115,116,101,108,95,50,55,50,49,49,64,112,97,110,100,97, %% Node: "distel_27211@panda"
0,0,188,139, %% ID
0,0,0,0, %% Serial
0>>. %% Creation

The format seems to be ok, so what causes the term parse error?

@leoliu
Copy link
Contributor

leoliu commented Apr 19, 2020

Based on #68 (comment):

the ID of the PID is 48267 which is 1011110010001011. ETF says ID "Only 15 bits are significant; the rest are to be 0." Looks like the 16th bit may be causing trouble.

@zwizwa
Copy link
Author

zwizwa commented Apr 19, 2020

If I change that Term to something that doesn't have bit 15 set then indeed it does parse with binary_to_term/1. Am I spawning too many processes? I do have a lot of transient processes in the logger I mentioned.

@leoliu
Copy link
Contributor

leoliu commented Apr 20, 2020

It’s not your fault. I will see if I can send in a fix today. There is likely a bug in distel that does not handle ID wraparound correctly.

Could you try the PR and see if it has any ill effect. Thanks.

@zwizwa
Copy link
Author

zwizwa commented Apr 22, 2020

Thanks. I've switched to f9c5067 as my daily driver. I'll let you know how it goes. I can confirm that in the buggy version the problems indeed started to happen when erl-pid-counter went over 32k.

Just FYI: Maybe I have github misconfigured, but it seems to not send notifications for message edits. I just saw your fix when I checked back to see if I could do some debugging myself... It does seem to send notifications for new messages.

@leoliu
Copy link
Contributor

leoliu commented Apr 23, 2020

My bad. I was not fully aware of how gh notifications work. I might amend the PR with some doc fix as follows.

--- a/elisp/erl.el
+++ b/elisp/erl.el
@@ -58,7 +58,7 @@ (defun make-erl-local-pid ()
     pid))
 
 (defun erl-get-counter (counter)
-  "Get and then advance counter."
+  "Get and then advance COUNTER."
   (cl-macrolet ((modf (place y) `(cl-callf mod ,place ,y))
                 (roundf (place) `(cl-callf round ,place)))
     (cl-labels ((increment (cells)
@@ -83,7 +83,7 @@ (defun erl-determine-hostname ()
 
 ;; (NUM SERIAL)
 (defvar erl-pid-counter '((0 . #x8000) (0 . #x2000))
-  "Counters for PIDs.")
+  "Counter for PIDs.")
 
 (defvar erl-node-name
   (intern (format "distel_%S@%s" (emacs-pid) (erl-determine-hostname)))

@zwizwa
Copy link
Author

zwizwa commented Apr 23, 2020

I'll keep an eye on the counter :)

Currently at:
erl-pid-counter
((8917 . 32768) (0 . 8192))

@zwizwa
Copy link
Author

zwizwa commented May 1, 2020

Things started to misbehave once it wrapped. Currently at
((9 . 32768) (1 . 8192))

My code is a horrible rube goldberg machine so I'm not really clear what exactly happens. I'll look into this some more to see if I can get you a decent error log or a fix.

@zwizwa
Copy link
Author

zwizwa commented May 1, 2020

I ran a test until the next wrap around and something went wrong again.

In the minibuffer, I get "Buffer "derl [email protected]" has a running process; kill it? (yes or no) "

The beam says:
=ERROR REPORT==== 1-May-2020::18:38:55 ===
** Node distel_13251@panda not responding **
** Removing (timedout) connection **

I get an error message in the emacs Messages log:
erl-pid->buffer: No buffer for pid [TYPE erl-pid distel_13251@panda 34 1 0]

And the erl-pid-counter is at
((35 . 32768) (2 . 8192))

So it seems that the 34 1 0 in the error message is supposed to be 34 2 0.

@zwizwa
Copy link
Author

zwizwa commented May 1, 2020

I think I found the issue in:

(defun erl-pid->buffer (pid)
"Get PID's buffer."
(or (cdr (assoc (erl-pid-id pid) erl-process-buffer-alist))
(error "No buffer for pid %S" pid)))

That only uses erl-pid-id and not erl-pid-serial.

My code fails because right after wraparound 1->2 this no longer finds a registered process due to erl-pid->buffer failing for [TYPE erl-pid distel_13251@panda 34 1 0]

@zwizwa
Copy link
Author

zwizwa commented May 1, 2020

Also erl-unenroll-process removes processes if just the id matches. Seems like the fix would be to change the keying for erl-process-buffer-alist.

@leoliu
Copy link
Contributor

leoliu commented May 2, 2020

@zwizwa thanks for catching this. I have added another commit to the PR which hopefully should fix this issue.

@zwizwa
Copy link
Author

zwizwa commented May 4, 2020

Thanks I've switched to baaa46b for daily use.

@zwizwa
Copy link
Author

zwizwa commented May 10, 2020

It looks ok now. I've not run into any of these issues again, and currently erl-pid-counter is at ((14075 . 32768) (4 . 8192))
Thanks again.

@leoliu
Copy link
Contributor

leoliu commented May 10, 2020

@zwizwa Excellent news. I added another commit just now to make sure make-erl-local-pid never overrides existing pids. This bug is much harder to trigger but is possible. Are you able to give it a brief test to make sure everything is still working? Thanks.

@zwizwa
Copy link
Author

zwizwa commented May 11, 2020

I switched to df6d495

@zwizwa
Copy link
Author

zwizwa commented May 11, 2020

I think this broke something. Starting to run into errors like this:
EXIT: <0.16.0> [emacs-error "(error "erl-pid-node accessing a non-erl-pid")"]
I'll see if I can trace it down.

@leoliu
Copy link
Contributor

leoliu commented May 11, 2020

It may be a fault of erl-null-pid-p. I have amended last commit to make it behave the same as before. Could you check again and see if the problem persists? Thanks a lot.

@leoliu
Copy link
Contributor

leoliu commented May 19, 2020

@zwizwa are you able to try 3b04b66? If there are no issues we can suggest to the maintainers to merge the PR. Thanks.

@zwizwa
Copy link
Author

zwizwa commented May 19, 2020

Sorry, busy week. I'll give it a try tomorrow after work.

@zwizwa
Copy link
Author

zwizwa commented May 20, 2020

Tried 3b04b66 but it broke something. Reverted to previous but still broken so this is likely my bug. I'll look again tomorrow with a fresh eye.

@leoliu
Copy link
Contributor

leoliu commented May 20, 2020

@zwizwa Thanks a lot for testing. BTW there is no hurry we can take as much time as needed.

@zwizwa
Copy link
Author

zwizwa commented May 20, 2020

You're welcome. I really appreciate your maintenance effort. Distel is an integral part of my setup. What would I do without it :)

The bug was indeed mine. I'll switch back to the new head after work today.

@zwizwa
Copy link
Author

zwizwa commented May 27, 2020

It's been running without issues the whole week.

@leoliu
Copy link
Contributor

leoliu commented May 27, 2020

@zwizwa excellent news!

@massemanet PR #69 is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants