Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsc failed to honor a client stop request #1390

Open
leftwo opened this issue Jul 22, 2024 · 3 comments
Open

dsc failed to honor a client stop request #1390

leftwo opened this issue Jul 22, 2024 · 3 comments

Comments

@leftwo
Copy link
Contributor

leftwo commented Jul 22, 2024

In a CI run, the test-up-encrypted.sh hung wile doing the replace-reconcile test
(original CI link, may be broken: https://github.com/oxidecomputer/crucible/pull/1387/checks?check_run_id=27758312216
)

Copy of logs from the buildomat instance are here: /staff/tmp/buildomat-crucible-20240722/
Looking at the dsc log:

/var/tmp/bins/dsc start --region-count 4 --output-dir /tmp/test_up/dsc --ds-bin /var/tmp/bins/crucible-downstairs --region-dir /var/tmp/test_up
Using existing output directory "/tmp/test_up/dsc"
start ds: 8810
start access at:127.0.0.1:9998
start ds: 8820
start ds: 8830
start ds: 8840
Starting downstairs at port 8820
Starting downstairs at port 8810
Make output file at "/tmp/test_up/dsc/downstairs-8810.txt"
Make output file at "/tmp/test_up/dsc/downstairs-8820.txt"
Starting downstairs at port 8830
Make output file at "/tmp/test_up/dsc/downstairs-8830.txt"
Starting downstairs at port 8840
Make output file at "/tmp/test_up/dsc/downstairs-8840.txt"
[8820][1] reports Starting
[8810][0] reports Starting
[8830][2] reports Starting
[8840][3] reports Starting

Normal so far, we are starting up 4 downstairs at the given locations,

Downstairs /var/tmp/test_up/8830 port 8830 PID:Some(1233)
[8830][2] reports Running
Downstairs /var/tmp/test_up/8840 port 8840 PID:Some(1235)
[8840][3] reports Running
Downstairs /var/tmp/test_up/8820 port 8820 PID:Some(1236)
[8820][1] reports Running
Jul 22 16:05:29.547 INFO listening, local_addr: 127.0.0.1:9998
Control access at:127.0.0.1:9998

However, only three downstairs make it to running, 8810 (client 0) we don't get a message for.

A client requests a stop of 0:

Jul 22 16:05:34.651 INFO accepted connection, remote_addr: 127.0.0.1:36794, local_addr: 127.0.0.1:9998
Main task has work to do, go find it
got dsc Stop(0)
stop 0
Jul 22 16:05:34.653 INFO request completed, latency_us: 281, response_code: 204, uri: /stop/cid/0, method: POST, req_id: 1657e1e3-0d78-4c0d-8636-47673d11ec46, remote_addr: 127.0.0.1:36794, local_addr: 127.0.0.1:9998

But, dsc does not stop it (it's not running I suspect so dsc has no action to take).
The client then comes back and keeps asking for state waiting for it to be stopped:

Jul 22 16:05:34.653 INFO request completed, latency_us: 231, response_code: 200, uri: /state/cid/0, method: GET, req_id: fd342e3b-d2af-42bb-bddd-65e61b47c55b, remote_addr: 127.0.0.1:36794, local_addr: 127.0.0.1:9998
Jul 22 16:05:38.656 INFO request completed, latency_us: 188, response_code: 200, uri: /state/cid/0, method: GET, req_id: 6ff20b93-9a5e-49cb-8b6e-dbf8e5fbb631, remote_addr: 127.0.0.1:36794, local_addr: 127.0.0.1:9998
Jul 22 16:05:42.658 INFO request completed, latency_us: 204, response_code: 200, uri: /state/cid/0, method: GET, req_id: d33a8c72-3ef9-469b-bd33-f2bbb2664a9d, remote_addr: 127.0.0.1:36794, local_addr: 127.0.0.1:9998

This stopped state never happens, so the test loops forever.

@leftwo
Copy link
Contributor Author

leftwo commented Jul 22, 2024

On the buildomat instance, we can see that the downstiars processes are all running:

root@w-01J3DJTGEXKW798452554KV3BK:~# ptree $(pgrep -x agent)
9      /lib/svc/bin/svc.startd
  1083   /usr/bin/ctrun -l child -o noorphan,regent /opt/buildomat/lib/agent run
    1084   /opt/buildomat/lib/agent run
      1138   /bin/bash /tmp/2as1nND5Cpns3oYl.sh
        1158   ptime -m bash /input/build/work/scripts/test_up.sh -N unencrypted
          1159   bash /input/build/work/scripts/test_up.sh -N unencrypted
            1231   /var/tmp/bins/dsc start --region-count 4 --output-dir /tmp/test_up/dsc --ds-bin
              1233   /var/tmp/bins/crucible-downstairs run -p 8830 -d /var/tmp/test_up/8830 --mode r
              1234   /var/tmp/bins/crucible-downstairs run -p 8810 -d /var/tmp/test_up/8810 --mode r
              1235   /var/tmp/bins/crucible-downstairs run -p 8840 -d /var/tmp/test_up/8840 --mode r
              1236   /var/tmp/bins/crucible-downstairs run -p 8820 -d /var/tmp/test_up/8820 --mode r
            1238   /var/tmp/bins/crutest replace-reconcile --replacement 127.0.0.1:8840 -c 4 -g 34

And, looking at logs for both 8810 and 8820, the look to be the same:

alan@atrium:test_up$  cat dsc/downstairs-8810.txt | looker
16:05:29.554Z INFO crucible: Opened existing region file "/var/tmp/test_up/8810/region.json"
16:05:29.555Z INFO crucible: Database read version 1
16:05:29.555Z INFO crucible: Database write version 1
16:05:29.560Z INFO crucible: UUID: 12345678-0000-0000-0000-000000008810
16:05:29.560Z INFO crucible: Blocks per extent:100 Total Extents: 100
16:05:29.601Z INFO crucible: Crucible Version: Crucible Version: 0.0.1
    Commit SHA: c1d472b8361f1d379eb22df5d92f70e429541826
    Commit timestamp: 2024-07-22T15:46:54.000000000Z  branch: main
    rustc: 1.76.0 stable x86_64-unknown-illumos
    Cargo: x86_64-unknown-illumos  Debug: true Opt level: 0
    task = main
16:05:29.601Z INFO crucible: Upstairs <-> Downstairs Message Version: 10
    task = main
16:05:29.602Z INFO crucible: Repair listens on 0.0.0.0:12810 for path:"/var/tmp/test_up/8810"
    task = repair
16:05:29.602Z INFO crucible: listening
    local_addr = 0.0.0.0:12810
    task = repair
16:05:29.602Z INFO crucible: Using repair address: 0.0.0.0:12810
    task = main
16:05:29.602Z INFO crucible: No SSL acceptor configured
    task = main
16:05:29.602Z INFO crucible: downstairs listening on 0.0.0.0:8810
    task = main
16:05:34.551Z INFO crucible: accepted connection from 127.0.0.1:36877
    task = main
16:05:34.552Z INFO crucible: connection (127.0.0.1:36877): tasks spawned
    id = 0
16:05:34.552Z INFO crucible: Connection request from 891c432a-6380-4a2e-a3ae-aa01648d8db4 with version 10
16:05:34.552Z INFO crucible: upstairs UpstairsConnection { upstairs_id: 891c432a-6380-4a2e-a3ae-aa01648d8db4, session_id: c1d88655-e046-4575-a26e-032ca2ac9273, gen: 34 } connected, version 10
16:05:34.553Z INFO crucible: UpstairsConnection { upstairs_id: 891c432a-6380-4a2e-a3ae-aa01648d8db4, session_id: c1d88655-e046-4575-a26e-032ca2ac9273, gen: 34 } is now active (read-write)
16:05:34.554Z INFO crucible: Current flush_numbers [0..12]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
alan@atrium:test_up$ 
alan@atrium:test_up$ 
alan@atrium:test_up$  cat dsc/downstairs-8820.txt | looker
16:05:29.558Z INFO crucible: Opened existing region file "/var/tmp/test_up/8820/region.json"
16:05:29.559Z INFO crucible: Database read version 1
16:05:29.559Z INFO crucible: Database write version 1
16:05:29.565Z INFO crucible: UUID: 12345678-0000-0000-0000-000000008820
16:05:29.565Z INFO crucible: Blocks per extent:100 Total Extents: 100
16:05:29.607Z INFO crucible: Crucible Version: Crucible Version: 0.0.1
    Commit SHA: c1d472b8361f1d379eb22df5d92f70e429541826
    Commit timestamp: 2024-07-22T15:46:54.000000000Z  branch: main
    rustc: 1.76.0 stable x86_64-unknown-illumos
    Cargo: x86_64-unknown-illumos  Debug: true Opt level: 0
    task = main
16:05:29.607Z INFO crucible: Upstairs <-> Downstairs Message Version: 10
    task = main
16:05:29.608Z INFO crucible: Repair listens on 0.0.0.0:12820 for path:"/var/tmp/test_up/8820"
    task = repair
16:05:29.608Z INFO crucible: listening
    local_addr = 0.0.0.0:12820
    task = repair
16:05:29.608Z INFO crucible: Using repair address: 0.0.0.0:12820
    task = main
16:05:29.608Z INFO crucible: No SSL acceptor configured
    task = main
16:05:29.608Z INFO crucible: downstairs listening on 0.0.0.0:8820
    task = main
16:05:34.551Z INFO crucible: accepted connection from 127.0.0.1:64343
    task = main
16:05:34.552Z INFO crucible: connection (127.0.0.1:64343): tasks spawned
    id = 0
16:05:34.552Z INFO crucible: Connection request from 891c432a-6380-4a2e-a3ae-aa01648d8db4 with version 10
16:05:34.552Z INFO crucible: upstairs UpstairsConnection { upstairs_id: 891c432a-6380-4a2e-a3ae-aa01648d8db4, session_id: c1d88655-e046-4575-a26e-032ca2ac9273, gen: 34 } connected, version 10
16:05:34.553Z INFO crucible: UpstairsConnection { upstairs_id: 891c432a-6380-4a2e-a3ae-aa01648d8db4, session_id: c1d88655-e046-4575-a26e-032ca2ac9273, gen: 34 } is now active (read-write)
16:05:34.554Z INFO crucible: Current flush_numbers [0..12]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Which suggests that the downstairs did start up correctly.

@leftwo
Copy link
Contributor Author

leftwo commented Jul 22, 2024

So, this is odd, from dsc/src/main.rs, line 256:

    fn start(&self) -> Result<Child> {
        println!("Make output file at {:?}", self.output_file);

We see that log message for all downstairs

        let outputs = File::create(&self.output_file)
            .context("Failed to create output file")?;
        let errors = outputs.try_clone()?;

        let port_value = format!("{}", self.port);

        let mode = if self.read_only {
            "ro".to_string()
        } else {
            "rw".to_string()
        };
    
        let region_dir = self.region_dir.clone();
        let cmd = Command::new(self.ds_bin.clone())
            .args([
                "run",
                "-p",
                &port_value,
                "-d",
                &region_dir,
                "--mode", 
                &mode,
            ])
            .stdout(Stdio::from(outputs))
            .stderr(Stdio::from(errors))
            .spawn()
            .context("Failed trying to run downstairs")?;

        println!(
            "Downstairs {} port {} PID:{:?}",
            region_dir,
            self.port,
            cmd.id()
        );  

But, we only see this second log message for clients 1,2 and 3, not for client 0.

But, the process listing does show downstairs client 0 (8810) as running.

@leftwo
Copy link
Contributor Author

leftwo commented Jul 22, 2024

A pstack on the buildomat host does show one thread waiting on a Command to return:

----------------- thread# 10 / lwp# 10 [dsc] -----------------
 fffffc7fef17b2aa read     (18, fffffc7febbfbec0, 8)
 00000000038a884b std::sys::unix::process::process_inner::<impl std::sys::unix::process::process_common::Command>::spawn::h17c9e8acc38ad655 () + 29b
 000000000389723c std::process::Command::spawn::hb33ada688364cbc0 () + 1c
 00000000034b97c9 tokio::process::imp::spawn_child::h5f8f525c15677a49 () + 29
 00000000034d7740 tokio::process::Command::spawn::h30da75f913476e54 () + 30
 0000000001d6729c dsc::DownstairsInfo::start::hf3f0ac949106f123 () + 3fc
 0000000001d2bc92 dsc::start_ds::{{closure}}::h349f7c3cfcd381c8 () + 2b2
 0000000001d2c07e dsc::ds_start_monitor::{{closure}}::hc0bf488ec21ff1e4 () + 1ae
 0000000001d2b943 dsc::start_dsc::{{closure}}::{{closure}}::h4ace2b8c32fa0c56 () + e3
 0000000001ca8856 tokio::runtime::task::core::Core<T,S>::poll::{{closure}}::hd054d50670ef0835 () + 86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant