Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propolis server panicked after hitting some ???????? rust_panic call stack #755

Open
askfongjojo opened this issue Sep 1, 2024 · 2 comments
Milestone

Comments

@askfongjojo
Copy link

@augustuswm found a VM that had a propolis server panic. Here are the propolis log lines at the time:

18:41:28.222Z INFO propolis-server (vcpu_tasks): vCPU released from hold
    vcpu = 0
18:41:28.222Z INFO propolis-server (vm_state_worker): publishing new instance state
    gen = 9
    migration = InstanceMigrateStatusResponse { migration_in: None, migration_out: None }
    state = Running
18:41:28.222Z INFO propolis-server (vm_state_worker): State worker handled event
    outcome = Continue
18:41:28.222Z INFO propolis-server (vcpu_tasks): vCPU released from hold
    vcpu = 1
18:41:28.269Z INFO propolis-server: request completed
    latency_us = 55
    local_addr = [fd00:1122:3344:116::1:a6]:12400
    method = GET
    remote_addr = [fd00:1122:3344:116::1]:52233
    req_id = 8c5001b9-e06d-4606-acae-3fe5ab9cfdb0
    response_code = 200
    uri = /instance/state-monitor
18:41:28.637Z INFO propolis-server (ramfb): ramfb change
    config = Config { addr: 3198525440, fourcc: 875713112, flags: 0, width: 800, height: 600, stride: 3200 }
    state = valid
18:41:28.637Z INFO propolis-server (ramfb): notifying
18:41:28.637Z INFO propolis-server (vnc-server): pixel format set to fourcc=0x34325258
thread 'vcpu-0' panicked at bin/propolis-server/src/lib/vm/mod.rs:386:9:
vCPU 0: Unhandled VM exit: InstEmul(InstEmul { inst_data: [175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175, 175], len: 15 })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Aug 12 18:41:34 Stopping because all processes in service exited. ]
[ Aug 12 18:41:34 Executing stop method (:kill). ]

Both the core file and propolis log can be found in /staff/rack3/BRM42220077/2024-08-29 (system-illumos-propolis-server:default.log.1723488297 and core.oxz_propolis-server_cdcb2537-a5ee-4e9d-97d8-d8a86b57d2dd.propolis-server.23389.1723488093.

@leftwo got the stack from the core file:

 fffff5ffef15b6ea _lwp_kill () + a                                                                                   
 fffff5ffef0ec6b2 raise (6) + 22                                                                                     
 fffff5ffef0c5dc8 abort () + 58                                                                                      
 0000000002592cd9 ???????? ()                                                                                        
 0000000002592cc9 ???????? ()                                                                                        
 000000000257b97d rust_panic () + d                                                                                  
 000000000257b636 std::panicking::rust_panic_with_hook::h0c682abf0664e68a () + 296                                   
 000000000257b362 std::panicking::begin_panic_handler::{{closure}}::h0d97730e7ba5f708 () + a2                        
 0000000002578db9 ???????? ()                             
 000000000257b0a6 ???????? ()                                                                                        
 00000000025bf522 ???????? ()                                                                                        
 00000000011957c8 ???????? ()                                                                                        
 000000000110f5ea ???????? ()                                                                                        
 000000000111412e core::ops::function::FnOnce::call_once{{vtable.shim}}::h19c41ab30f52a8dc () + 8e                   
 0000000002583dcb std::sys::pal::unix::thread::Thread::new::thread_start::h9ef84bfafda8297d () + 1b                  
 fffff5ffef154307 _thrp_setup (fffff5ffeef35240) + 77                                                                
 fffff5ffef154650 _lwp_start ()
@askfongjojo
Copy link
Author

Also took a look at the crucible downstairs around that time and didn't see anything out of the ordinary.

root@oxz_switch1:~# omdb db disks list | grep 8a713b03-e6a2-4659-90f1-d3fce8a071ef
master-027138b2-8f92-4d70-8b31-59a-e9de02                    e45db0db-f9d8-47e6-9be1-3aef84da229f 10 GiB  attached                   8a713b03-e6a2-4659-90f1-d3fce8a071ef 

root@oxz_switch1:~# omdb db disks info e45db0db-f9d8-47e6-9be1-3aef84da229f
HOST_SERIAL DISK_NAME                                 INSTANCE_NAME PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE 
BRM42220077 master-027138b2-8f92-4d70-8b31-59a-e9de02 master        oxz_propolis-server_cdcb2537-a5ee-4e9d-97d8-d8a86b57d2dd 10fac653-2a79-415b-bdf6-51703b502ac9 attached   
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK                        
BRM42220035 b5b90b8d-53b7-410a-81f2-f99bc9c41c2b oxz_crucible_3e598962-ef8c-4cb6-bdfe-ec8563939d6a 44af7b26-cb4f-4a38-bbbf-700159e4252a 
BRM42220013 ee2c2eda-9fc7-4e25-83f9-773fd33d478b oxz_crucible_475140fa-a5dc-4ec1-876d-751c48adfc37 d5345689-fdc8-4009-bef8-978d2492f302 
BRM42220089 60bb712a-798b-43e2-87e2-dae6a68f94fa oxz_crucible_de376149-aa45-4660-9ae6-15e8ba4a4233 517b2733-1261-4a3c-813b-f92a7b5eec66 

VCR from volume ID 10fac653-2a79-415b-bdf6-51703b502ac9
ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT 
e45db0db-f9d8-47e6-9be1-3aef84da229f 512 1           false            

SUB VOLUME 0
    ID                                   BS  BPE    EC  GEN READ_ONLY 
    e45db0db-f9d8-47e6-9be1-3aef84da229f 512 131072 160 3   false     
    [fd00:1122:3344:10c::5]:19000
    [fd00:1122:3344:111::a]:19011
    [fd00:1122:3344:108::4]:19006
BRM42220089 # cat /pool/ext/dbfdc981-1b81-4d7d-9449-9530890b199a/crypt/debug/oxz_crucible_de376149-aa45-4660-9ae6-15e8ba4a4233/oxide-crucible-downstairs:downstairs-60bb712a-798b-43e2-87e2-dae6a68f94fa.log.1723488094 | looker
18:38:53.998Z INFO crucible: upstairs disconnected
    id = 0
    task = recv
18:38:53.998Z WARN crucible: recv_task sending ConnectionClosed
    id = 0
    task = recv
18:38:53.998Z INFO crucible: connection closed; disconnection
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) removed, 0 jobs left
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) was previously active, clearing
18:39:39.434Z INFO crucible: accepted connection from [fd00:1122:3344:116::1:a6]:56076
    task = main
18:39:39.434Z INFO crucible: connection ([fd00:1122:3344:116::1:a6]:56076): tasks spawned
    id = 1
18:39:39.434Z INFO crucible: Connection request from e45db0db-f9d8-47e6-9be1-3aef84da229f with version 8
18:39:39.434Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } connected, version 8
18:39:39.434Z INFO crucible: UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } is now active (read-write)
18:39:39.443Z INFO crucible: Current flush_numbers [0..12]: [276, 306, 335, 364, 394, 423, 452, 482, 511, 540, 570, 599]
18:41:34.066Z INFO crucible: upstairs disconnected
    id = 1
    task = recv
18:41:34.066Z WARN crucible: recv_task sending ConnectionClosed
    id = 1
    task = recv
18:41:34.066Z INFO crucible: connection closed; disconnection
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) removed, 0 jobs left
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) was previously active, clearing
BRM42220013 # cat /pool/ext/a1856609-b49f-4d30-a0e3-2e8154dbcdbf/crypt/debug/oxz_crucible_475140fa-a5dc-4ec1-876d-751c48adfc37/oxide-crucible-downstairs:downstairs-ee2c2eda-9fc7-4e25-83f9-773fd33d478b.log.1723488094 | looker
18:38:53.998Z INFO crucible: upstairs disconnected
    id = 0
    task = recv
18:38:53.998Z WARN crucible: recv_task sending ConnectionClosed
    id = 0
    task = recv
18:38:53.998Z INFO crucible: connection closed; disconnection
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) removed, 0 jobs left
18:38:53.998Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) was previously active, clearing
18:39:39.434Z INFO crucible: accepted connection from [fd00:1122:3344:116::1:a6]:55290
    task = main
18:39:39.434Z INFO crucible: connection ([fd00:1122:3344:116::1:a6]:55290): tasks spawned
    id = 1
18:39:39.434Z INFO crucible: Connection request from e45db0db-f9d8-47e6-9be1-3aef84da229f with version 8
18:39:39.434Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } connected, version 8
18:39:39.434Z INFO crucible: UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } is now active (read-write)
18:39:39.486Z INFO crucible: Current flush_numbers [0..12]: [276, 306, 335, 364, 394, 423, 452, 482, 511, 540, 570, 599]
18:41:34.066Z INFO crucible: upstairs disconnected
    id = 1
    task = recv
18:41:34.066Z WARN crucible: recv_task sending ConnectionClosed
    id = 1
    task = recv
18:41:34.066Z INFO crucible: connection closed; disconnection
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) removed, 0 jobs left
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) was previously active, clearing
BRM42220035 # cat /pool/ext/95220093-e3b8-4f7f-9f5a-cb32cb75180a/crypt/debug/oxz_crucible_3e598962-ef8c-4cb6-bdfe-ec8563939d6a/oxide-crucible-downstairs:downstairs-b5b90b8d-53b7-410a-81f2-f99bc9c41c2b.log.1723488094 | looker
18:38:54.001Z INFO crucible: upstairs disconnected
    id = 0
    task = recv
18:38:54.001Z WARN crucible: recv_task sending ConnectionClosed
    id = 0
    task = recv
18:38:54.001Z INFO crucible: connection closed; disconnection
18:38:54.001Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) removed, 0 jobs left
18:38:54.001Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: 06191f6e-dd09-4041-ae1c-ffb9d1e94b60, gen: 1 } (ConnectionId(0)) was previously active, clearing
18:39:39.434Z INFO crucible: accepted connection from [fd00:1122:3344:116::1:a6]:44735
    task = main
18:39:39.434Z INFO crucible: connection ([fd00:1122:3344:116::1:a6]:44735): tasks spawned
    id = 1
18:39:39.434Z INFO crucible: Connection request from e45db0db-f9d8-47e6-9be1-3aef84da229f with version 8
18:39:39.434Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } connected, version 8
18:39:39.434Z INFO crucible: UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } is now active (read-write)
18:39:39.444Z INFO crucible: Current flush_numbers [0..12]: [276, 306, 335, 364, 394, 423, 452, 482, 511, 540, 570, 599]
18:41:34.066Z INFO crucible: upstairs disconnected
    id = 1
    task = recv
18:41:34.066Z WARN crucible: recv_task sending ConnectionClosed
    id = 1
    task = recv
18:41:34.066Z INFO crucible: connection closed; disconnection
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) removed, 0 jobs left
18:41:34.066Z INFO crucible: upstairs UpstairsConnection { upstairs_id: e45db0db-f9d8-47e6-9be1-3aef84da229f, session_id: ef66202d-9830-4a91-9bf4-c22be271f29d, gen: 2 } (ConnectionId(1)) was previously active, clearing

@pfmooney
Copy link
Collaborator

pfmooney commented Sep 1, 2024

Looking at the opcode(s), it would seem that the guest vCPU jumped off into space. This lends more credence to the necessity of #335.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants