Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: helios/deploy CLI SendError #6771

Open
jgallagher opened this issue Oct 4, 2024 · 20 comments
Open

test failed in CI: helios/deploy CLI SendError #6771

jgallagher opened this issue Oct 4, 2024 · 20 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@jgallagher
Copy link
Contributor

This test failed on a CI run on "main":

https://github.com/oxidecomputer/omicron/runs/31088964323

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ

Excerpt from the log showing the failure:

976	2024-10-04T15:58:09.854Z	+ /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.2.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
977	2024-10-04T15:59:47.203Z	sending chunk to thread failed with SendError { .. }
978	2024-10-04T15:59:48.324Z	channel closed
@jgallagher jgallagher added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Oct 4, 2024
@jgallagher jgallagher changed the title test failed in CI: NAME_OF_TEST test failed in CI: helios/deploy CLI SendError Oct 4, 2024
@davepacheco
Copy link
Collaborator

Figuring that kind of failure might result from a Nexus crash, I took a look at the Nexus logs. I don't see any crashes but I do see some related-looking request failures. All the import-related request log entries seem to be in this log:
https://buildomat.eng.oxide.computer/wg/0/artefact/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ/01J9C46T7F5WYNKPWE02126CE3/oxide-nexus:default.log?format=x-bunyan

and I see a bunch of messages like this one:
https://buildomat.eng.oxide.computer/wg/0/artefact/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ/01J9C46T7F5WYNKPWE02126CE3/oxide-nexus:default.log?format=x-bunyan#L35948

2024-10-04T15:59:47.315Z	INFO	nexus (dropshot_external): request completed
    error_message_external = cannot import blocks with a bulk write for disk in state ImportReady
    error_message_internal = cannot import blocks with a bulk write for disk in state ImportReady
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:938
    latency_us = 451297
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 10.151.2.100:52890
    req_id = e74f85b8-5e08-495e-a56f-2239902eeaab
    response_code = 400
    uri = /v1/disks/debian11-boot/bulk-write?project=images

and a couple where the state is Finalizing or Detached.

@jmpesp
Copy link
Contributor

jmpesp commented Oct 4, 2024

I'm unable to reproduce this now, of course, though I noticed that the oxide-v0.1.0 binary that is grabbed during the deploy job is quite old:

james@atrium ~ $ curl -sSfL -I http://catacomb.eng.oxide.computer:12346/oxide-v0.1.0
...
Last-Modified: Tue, 05 Dec 2023 21:41:37 GMT
...

@jmpesp
Copy link
Contributor

jmpesp commented Oct 4, 2024

I was finally able to reproduce this using the old CLI binary:

[00:00:16] [█████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 954.50 MiB/2.00 GiB (75.49 MiB/s, 14s)
sending chunk to thread failed with SendError { .. }
channel closed

The trick was to set a --timeout to something very small:

./oxide-v0.1.0 --timeout 2 ...

causing the .disk_bulk_write_import call to timeout, causing the upload thread to drop the receiver. This makes me believe it's related to IO speed being very slow for this setup, as the normal timeout without the explicit argument is probably the progenitor default of 15 seconds!

In contrast, the latest CLI does responds to a timeout differently:

[00:00:17] [███████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 907.50 MiB/2.00 GiB (52.85 MiB/s, 22s
Communication Error: error sending request for url (http://fancyfeast:8080/v1/disks/crucible-tester-duster-disk?project=myproj): operation timed out: sending chunk to thread failed: channel closed

@jmpesp
Copy link
Contributor

jmpesp commented Oct 4, 2024

This is almost certainly the case: Nexus reports that a client disconnected during a request to the bulk write endpoint:

nexus (dropshot_external): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:887
    latency_us = 1546327
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 10.151.2.100:36568
    req_id = 6f78ea08-7945-43f1-8574-54f7b2b86ab8
    uri = /v1/disks/debian11-boot/bulk-write?project=images

@iliana
Copy link
Contributor

iliana commented Oct 8, 2024

Do we think updating the CLI binary is useful here, or will it similarly timeout?
And/or should we increase the timeout to handle the CI setup not being Gimlet-grade SSDs?

@jmpesp
Copy link
Contributor

jmpesp commented Oct 8, 2024

Do we think updating the CLI binary is useful here, or will it similarly timeout?

We probably should update it so we're testing the latest binary in general, but I suspect it will similarly time out.

should we increase the timeout to handle the CI setup not being Gimlet-grade SSDs?

I'm not sure: .github/buildomat/jobs/deploy.sh says that the target is lab-2.0-opte-0.33, and doesn't that mean that this is running on our hardware? Those shouldn't have disks so bad that it takes 15 seconds to write 512kb (!).

If I'm misreading this and it's running on AWS, are these t2.nano instances or something haha? Again, 15 seconds to write 512kb is huge, unless it's something like we don't have TRIM enabled and there's a big reclaim going on?

@iliana
Copy link
Contributor

iliana commented Oct 8, 2024

It is running on lab hardware, I'm just not sure what the SSDs in the machine are. (Also keep in mind that the ZFS pools used by the deployed control plane are file-backed.)

@leftwo
Copy link
Contributor

leftwo commented Oct 9, 2024

I'm using stock omicron bits on my bench gimlet and I can get this to happen 100% of the time:

EVT22200005 # oxide disk import --project alan --description "test2" --path /alan/cli/jammy-server-cloudimg-amd64.raw --disk test2
[00:00:55] [████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 377.50 MiB/2.20 GiB (3.53 MiB/s, 9m)sending chunk to thread failed with SendError { .. }
[00:00:55] [████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 378.50 MiB/2.20 GiB (3.53 MiB/s, 9m)
channel closed

I'm on omicron commit: 0640bb2
My bench gimlet is:

EVT22200005 # uname -a
SunOS EVT22200005 5.11 helios-2.0.22861 oxide i386 oxide

And, the cli:

EVT22200005 # oxide version
Oxide CLI 0.6.1+20240710.0
Built from commit: 4a896d13efa760c0e9120e5d82fb2b6b7f5a27da 
Oxide API: 20240821.0

I added a --timeout 60 to my import line, and I did get an upload to work.

@andrewjstone
Copy link
Contributor

@leftwo
Copy link
Contributor

leftwo commented Oct 9, 2024

I updated my bench gimlet to:

EVT22200005 # uname -a
SunOS EVT22200005 5.11 helios-2.0.22957 oxide i386 oxide

And, it still fails the same way.

@andrewjstone
Copy link
Contributor

I'm unclear if this is the same error, but it seems like it might be related:
https://github.com/oxidecomputer/omicron/pull/6810/checks?check_run_id=31332921093

702 | 2024-10-10T05:40:53.995Z | + /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.2.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
-- | -- | --
1703 | 2024-10-10T05:44:03.700Z | one of the upload threads failed
1704 | 2024-10-10T05:44:04.618Z | one of the upload threads failed

@jgallagher
Copy link
Contributor Author

I saw the same variant Andrew did on https://buildomat.eng.oxide.computer/wg/0/details/01J9Y2BWQWFB0HPRRFCWC6GHAN/HzxHMKpHk2lzsmDBBvStwfjrwTZfnljEjTH4TLjd2l3KTopv/01J9Y2CEDVND9FA5DK9G0AX5FY. That run is on #6822 which does modify the pantry code path, but I think the modifications it makes (changing how Nexus chooses a pantry) would result in a different kind of failure if they were wrong.

871     2024-10-11T16:02:24.918Z        + /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.1.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
872     2024-10-11T16:05:33.357Z        one of the upload threads failed
873     2024-10-11T16:05:34.447Z        one of the upload threads failed

@jgallagher
Copy link
Contributor Author

I hit this a few times in a4x2 yesterday. I'll try Alan's workaround of a higher timeout next time I see it and report back.

@hawkw
Copy link
Member

hawkw commented Oct 17, 2024

Observed this again on #6879 after it merged to main as commit e577765: https://buildomat.eng.oxide.computer/wg/0/details/01JADWA2KEP8QZFHFYKN64S72N/EMg9DuE9YnNp0draUTy1mUsUfXrDkrVzCz53bQfjKaG7Chwz/01JADWASA6T2RC0N248ZNJC5NT#S1312

Seems like the same variant John observed in #6771 (comment), but I figured I'd drop it here anyway in case it's useful.

@bnaecker
Copy link
Collaborator

I've hit this several times in a row trying to get #6889 merged. Not sure what the best path forward in this case is, but maybe we can bump the timeout as @leftwo suggested, in the helios/deploy job specifically?

@bnaecker
Copy link
Collaborator

In the interest of getting PRs merged more readily, I've increased the timeout to 60s in d69155e. This obviously sucks, and isn't a substitute for a deeper understanding of and better solution to the issue.

@leftwo
Copy link
Contributor

leftwo commented Oct 21, 2024

On the bench gimlet setup, I captured the pantry log while reproducing the error. The full log is here: /staff/cores/omicron-6771

Of interest, we have the initial creation and setup request:

15:24:52.335Z INFO crucible-pantry (dropshot): accepted connection                                                                        
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    remote_addr = [fd00:1122:3344:101::a]:38249                                                                                           
15:24:52.335Z INFO crucible-pantry (datafile): no entry exists for volume edda9e6e-c95d-4488-a73d-ec5bf91a121d, constructing...           
15:24:52.335Z INFO crucible-pantry (datafile): Upstairs starts                                                                            
15:24:52.335Z INFO crucible-pantry (datafile): Crucible Version: BuildInfo {                                                              
        version: "0.0.1",                                                                                                                 
        git_sha: "2b88ab88461fb06aaf2aab11c5e381a3cad25eac",                                                                              
        git_commit_timestamp: "2024-09-27T17:26:25.000000000Z",                                                                           
        git_branch: "main",                                                                                                               
        rustc_semver: "1.80.0",                                                                                                           
        rustc_channel: "stable",                                                                                                          
        rustc_host_triple: "x86_64-unknown-illumos",                                                                                      
        rustc_commit_sha: "051478957371ee0084a7c0913941d2a8c4757bb9",                                                                     
        cargo_triple: "x86_64-unknown-illumos",                                                                                           
        debug: false,                                                                                                                     
        opt_level: 3,                                                                                                                     
    }                                                                                                                                     
15:24:52.335Z INFO crucible-pantry (datafile): Upstairs <-> Downstairs Message Version: 11                                                
15:24:52.335Z INFO crucible-pantry (datafile): Using region definition RegionDefinition { block_size: 512, extent_size: Block { value: 131
072, shift: 9 }, extent_count: 48, uuid: 00000000-0000-0000-0000-000000000000, encrypted: true, database_read_version: 1, database_write_v
ersion: 1 }                                                                                                                               
15:24:52.336Z INFO crucible-pantry (datafile): Crucible edda9e6e-c95d-4488-a73d-ec5bf91a121d has session id: daff5380-9074-4e67-9d63-2597c
356c51f                                                                                                                                   
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.336Z INFO crucible-pantry (datafile): Upstairs opts: Upstairs UUID: edda9e6e-c95d-4488-a73d-ec5bf91a121d, Targets: [[fd00:1122:33
44:101::14]:19000, [fd00:1122:3344:101::12]:19000, [fd00:1122:3344:101::18]:19000], lossy: false, flush_timeout: None, key populated: true
,  cert_pem populated: false,  key_pem populated: false,  root_cert_pem populated: false,  Control: None,  read_only: false               
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.336Z INFO crucible-pantry (datafile): Crucible stats registered with UUID: edda9e6e-c95d-4488-a73d-ec5bf91a121d                  
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.380Z INFO crucible-pantry (datafile): volume edda9e6e-c95d-4488-a73d-ec5bf91a121d constructed ok                                 
15:24:52.380Z INFO crucible-pantry (datafile): The guest has requested activation                                                         
15:24:52.380Z INFO crucible-pantry (datafile): edda9e6e-c95d-4488-a73d-ec5bf91a121d active request set                                    
    session_id = daff5380-9074-4e67-9d63-2597c356c51f

Things start up and downstairs connect:

15:24:52.382Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:101::18]:19000) has region UUID 03f3231b-aacc-40c
1-a5fa-20dab91ee3f1                                                                                                                       
     = downstairs                                                                                                                         
    client = 2                                                                                                                            
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.382Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:101::12]:19000) has region UUID e5c795fe-4a2e-458
5-b9b0-c2d1221bbdfa                                                                                                                       
     = downstairs                                                                                                                         
    client = 1                                                                                                                            
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.382Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:101::14]:19000) has region UUID f475816b-a491-4f4
d-befc-1ab264dcc997                                                                                                                       
     = downstairs                                                                                                                         
    client = 0                                                                                                                            
    session_id = daff5380-9074-4e67-9d63-2597c356c51f  

We activate, and bulk uploads start:

15:24:52.383Z INFO crucible-pantry (datafile): Set Active after no reconciliation                                                         
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.383Z INFO crucible-pantry (datafile): The guest has finished waiting for activation                                              
15:24:52.383Z INFO crucible-pantry (datafile): volume edda9e6e-c95d-4488-a73d-ec5bf91a121d activated ok                                   
15:24:52.383Z INFO crucible-pantry (datafile): volume edda9e6e-c95d-4488-a73d-ec5bf91a121d constructed and inserted ok                    
15:24:52.383Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 47954                                                                                                                    
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:38249                                                                                           
    req_id = a413a283-ac89-4558-b1cf-d50fd4a72ea5                                                                                         
    response_code = 200                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d                                                                  
15:24:53.069Z INFO crucible-pantry (dropshot): accepted connection                                                                        
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
15:24:53.071Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 1526                                                                                                                     
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
    req_id = 8c73dbac-fe11-4432-b888-7dd051c86771                                                                                         
    response_code = 204                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write                                                       
15:24:53.545Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 1692                                                                                                                     
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
    req_id = 99ae709b-c811-4ce5-9fdb-19fe31bd0b26                                                                                         
    response_code = 204                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write                                                       
15:24:53.706Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 1730                                                                                                                     
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
    req_id = 01820eb1-7fc7-47f7-b338-12dae7ef2635                                                                                         
    response_code = 204                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write 

The bulk-write requests continue on an expected cadence, but then, it appears as if the pantry receives a message to disconnect from the downstairs:

15:25:12.018Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1537
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = 6ad44cf3-71cf-41c4-b4b9-d0b88f807659
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.344Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1500
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = 3c7332a6-6bf6-4fe4-8e70-46e5316f43a3
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.463Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1371
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = c26c225a-1195-4d75-8c43-e98323a1f180
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.600Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1759
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = 9070f2c7-968d-4eda-88d6-2cf8c55a33a2
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.887Z INFO crucible-pantry (dropshot): accepted connection
    local_addr = [fd00:1122:3344:101::11]:17000
    remote_addr = [fd00:1122:3344:101::a]:63656
15:25:12.887Z INFO crucible-pantry (datafile): detach removing entry for volume edda9e6e-c95d-4488-a73d-ec5bf91a121d
15:25:12.887Z INFO crucible-pantry (datafile): detaching volume edda9e6e-c95d-4488-a73d-ec5bf91a121d
15:25:12.887Z INFO crucible-pantry (datafile): Request to deactivate this guest
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): checking for deactivation
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): [0] cannot deactivate, job 1175 in state InProgress
     = downstairs
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate client 0
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): [1] cannot deactivate, job 1175 in state InProgress
     = downstairs
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate client 1
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): [2] cannot deactivate, job 1175 in state InProgress
     = downstairs
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate client 2
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate due to state Active
     = downstairs
    client = 0
    session_id = daff5380-9074-4e67-9d63-2597c356c51f

So, now the question is who is requesting the disconnect, and why?

@leftwo
Copy link
Contributor

leftwo commented Oct 21, 2024

In the nexus log during the same window (also at /staff/core/omicron-6771 )

I see the disk created okay.

I see the bulk import start:

15:24:53.068Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Modify
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:65319
    req_id = b166c760-4e4c-40a4-956a-c461f965743a
    resource = Disk { parent: Project { parent: Silo { parent: Fleet, key: b1085af4-1d7d-4af7-a113-21237bc973eb, lookup_type: ById(b1085af4-1d7d-4af7-a113-21237bc973eb) }, key: 58985ee6-5289-4e53-89dd-08823170e4d9, lookup_type: ByName("alan") }, key: edda9e6e-c95d-4488-a73d-ec5bf91a121d, lookup_type: ByName("jammysource") }
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:53.069Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 10485760 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448

I see bulk writes going:

15:24:53.538Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): roles
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:35065
    req_id = 0d0ba77c-5be1-4a7c-8520-8562a21846b8
    roles = RoleSet { roles: {(Silo, b1085af4-1d7d-4af7-a113-21237bc973eb, "admin")} }
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:53.542Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Modify
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:35065
    req_id = 0d0ba77c-5be1-4a7c-8520-8562a21846b8
    resource = Disk { parent: Project { parent: Silo { parent: Fleet, key: b1085af4-1d7d-4af7-a113-21237bc973eb, lookup_type: ById(b1085af4-1d7d-4af7-a113-21237bc973eb) }, key: 58985ee6-5289-4e53-89dd-08823170e4d9, lookup_type: ByName("alan") }, key: edda9e6e-c95d-4488-a73d-ec5bf91a121d, lookup_type: ByName("jammysource") }
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:53.542Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 119537664 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448
15:24:53.545Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): request completed
    file = /home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 471595
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:35065
    req_id = 0d0ba77c-5be1-4a7c-8520-8562a21846b8
    response_code = 204
    uri = /v1/disks/jammysource/bulk-write?project=alan

Here are the last two bulk writes that Nexus sends:

15:25:12.461Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 333447168 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448
...
15:25:12.598Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 257425408 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448

Those look to line up with the pantry logs for the last two writes it receives.

Next, it looks like Nexus has received a request to stopt the bulk writes:

15:25:12.614Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): roles
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    roles = RoleSet { roles: {} }
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan
15:25:12.615Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })   
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    resource = Database
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan
15:25:12.617Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): roles
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    roles = RoleSet { roles: {} }
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan
15:25:12.617Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })   
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    resource = Database
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan

The remote addr: 192.168.1.199 is the place where I am running the import command.
So, possibly the CLI has requested the stop?

Eventually, we see Nexus send the detach to the pantry:

15:25:12.850Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): sending detach for disk edda9e6e-c95d-4488-a73d-ec5bf91a121d to endpoint http://[fd00:1122:3344:101::11]:17000
    file = nexus/src/app/sagas/common_storage.rs:103
    saga_id = 7db31a2e-75b9-449e-a2a4-85260b36b39b
    saga_name = finalize-disk

This detach lines up with when the pantry is told to detach the disk.

@leftwo
Copy link
Contributor

leftwo commented Oct 21, 2024

My CLI version is:

EVT22200005 # oxide version  
Oxide CLI 0.7.0+20240821.0
Built from commit: b5a932c1cd8a3f6e7143eb6ce27a4dc4e277104c 
Oxide API: 20241009.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

8 participants