Skip to content

Merge feature/vcpu-runnable to feature/numa8 #6540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 245 commits into
base: feature/numa8
Choose a base branch
from

Conversation

gangj
Copy link
Contributor

@gangj gangj commented Jun 18, 2025

Merge feature/vcpu-runnable as there were some little tweaks for RRD1: #6530

edwintorok and others added 30 commits March 4, 2025 18:07
Create a XAPI database with a number of VMs/VDIs/VBDs and measure how long update_allowed_operations takes.
Can't really use 2400 VMs here yet, because even with 240 VMs takes ~15s to initialize the test.

```
╭─────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                             │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  update_allowed_operations/VDI  │         10145.1862 mjw/run│       7412588.8431 mnw/run│       53244625.3769 ns/run│
╰─────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

update_allowed_operations/VDI (ns):
 { monotonic-clock per run = 53244625.376923 (confidence: 53612028.619469 to 53103082.519729);
   r² = Some 0.992851 }
```

Signed-off-by: Edwin Török <[email protected]>
```
╭─────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                             │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  update_allowed_operations/VDI  │         10096.0354 mjw/run│       7412629.8723 mnw/run│       53075833.0400 ns/run│
╰─────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

update_allowed_operations/VDI (ns):
 { monotonic-clock per run = 53075833.040000 (confidence: 53469156.908088 to 52924201.003476);
   r² = Some 0.991453 }
```

Signed-off-by: Edwin Török <[email protected]>
O(log n) instead of O(n) complexity. Also filtering can be done more efficiently.

```
╭─────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                             │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  update_allowed_operations/VDI  │          9874.6062 mjw/run│       7412584.1277 mnw/run│       51364914.0200 ns/run│
╰─────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

update_allowed_operations/VDI (ns):
 { monotonic-clock per run = 51364914.020000 (confidence: 51756944.767313 to 51194994.874696);
   r² = Some 0.990799 }
```

On this test ~2-3% improvement (potentially more on larger lists).

Signed-off-by: Edwin Török <[email protected]>
Perform the cheaper check first, so that it will short-circuit the evaluation when false.

```
╭─────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                             │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  update_allowed_operations/VDI  │          9884.5615 mjw/run│       7412534.5215 mnw/run│       53189355.8923 ns/run│
╰─────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

update_allowed_operations/VDI (ns):
 { monotonic-clock per run = 53189355.892308 (confidence: 53722938.915014 to 52908047.166446);
   r² = Some 0.992578 }
```

Signed-off-by: Edwin Török <[email protected]>
Activate old xapi_vdi.update_allowed_operations optimization: 
get_internal_records_where does a linear scan currently, so operating on N VDIs is O(N^2).

Look at the VBD records directly, like before this 2013 commit which regressed it:
5097475

(We are going to optimize get_record separately so it doesn't go through serialization)

For now only do this when run on the coordinator to avoid potentially large number of VBD
round-trip database fetches.
We'll need to optimize the 'get_internal_record_where' later to also speed up the pool case.

```
╭─────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                             │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  update_allowed_operations/VDI  │          9205.8042 mjw/run│        964577.0228 mnw/run│        2868770.0725 ns/run│
╰─────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

update_allowed_operations/VDI (ns):
 { monotonic-clock per run = 2868770.072546 (confidence: 2947963.590731 to 2834338.835371);
   r² = Some 0.404284 }
```

Compared to the previous commit this is 18x faster.

Signed-off-by: Edwin Török <[email protected]>
…onfigure

Add new host object fields:
  - ssh_enabled
  - ssh_enabled_timeout
  - ssh_expiry
  - console_idle_timeout
Add new host/pool API to enable to set a temporary enabled SSH service timeout
  - set_ssh_enabled_timeout
Add new host/pool API to enable to set console timeout
  - set_console_idle_timeout

Signed-off-by: Lunfan Zhang <[email protected]>
This PR introduces support for Dom0 SSH control, providing the following
capabilities:

Query the SSH status.
Configure a temporary SSH enable timeout for a specific host or all
hosts in the pool.
Configure the console idle timeout for a specific host or all hosts in
the pool.

Changes
New Host Object Fields:

- `ssh_enabled`: Indicates whether SSH is enabled.
- `ssh_enabled_timeout`: Specifies the timeout for temporary SSH
enablement.
- `ssh_expiry`: Tracks the expiration time for temporary SSH enablement.
- `console_idle_timeout`: Configures the idle timeout for the console.

New Host/Pool APIs (This PR only include the change of data model, the
implementation of this API will be include in the next PR):

- `set_ssh_enabled_timeout`: Allows setting a temporary timeout for
enabling the SSH service.
- `set_console_idle_timeout`: Allows configuring the console idle
timeout.
Add "Changed" records for 2 APIs which were missed.

Fix "param_release" for 3 added parameters.

Signed-off-by: Gang Ji <[email protected]>
During pool join, create a new host obj in the remote pool coordinator
DB with the same SSH settings as pool coordinator.

Also configure SSH service locally before xapi restart which will
persist after xapi restart.

Signed-off-by: Gang Ji <[email protected]>
This consists of two parts, splitting the attach_and_activate into
functionally equivalent attach and activate functions, and splitting the
VBD_plug atomic itself's behaviour into two new atomics, VBD_attach and
VBD_activate.

If the new xenopsd_vbd_plug_unplug_legacy flag is true, the only
difference will be that VBD_plug calls attach and activate sequentially
instead of attach_and_activate.

If xenopsd_vbd_plug_unplug_legacy is false, the VBD_attach and
VBD_activate atomics will be used in place of VBD_plug in all places
that it is used. This should still be functionally equivalent.

The purpose of this change is to allow optimisations in this area as
there are some situations where they do not need to be called at the
same time. For example VBD_attach could be called outside of VM migrate
downtime to reduce the overall downtime.

Signed-off-by: Steven Woods <[email protected]>
This consists of two parts, splitting the unplug function into
functionally equivalent deactivate and detach functions, and splitting
the VBD_unplug atomic itself's behaviour into two new atomics:
VBD_deactivate and VBD_detach

If the xenopsd_vbd_plug_unplug_legacy flag is true, the only difference
will be that VBD_unplug calls deactivate and detach sequentially instead
of unplug

If xenopsd_vbd_plug_unplug_legacy is false, the VBD_deactivate and
VBD_detach atomics will be used in place of VBD_unplug in all places
that it is used. This should still be functionally equivalent.

The purpose of this change is to allow optimisations in this area as
there are some situations where they do not need to be called at the
same time. For example we could skip detaching on reboot and only
deactivate and activate again reducing reboot time.

Signed-off-by: Steven Woods <[email protected]>
Useless here, the local DB will be dropped soon as the joinner will
switch to the remote DB of the new coordinator.

And latest_synced_updates_applied will be set to `unknown in host.create
in remote DB as default value.

Signed-off-by: Gang Ji <[email protected]>
After being ejected from a pool, a new host obj will be created with
default settings in DB.

This commit configures SSH service in the ejected host to default state
during pool eject.

Signed-off-by: Gang Ji <[email protected]>
Implemented XAPI APIs:
  - `host.set_console_idle_timeout`
  - `pool.set_console_idle_timeout`

These APIs allow XAPI to configure timeout for idle console sessions.

Signed-off-by: Lunfan Zhang <[email protected]>
Implemented XAPI APIs:
  - `host.set_ssh_enabled_timeout`
  - `pool.set_ssh_enabled_timeout`
These APIs allow XAPI to configure timeout for SSH service.
`host.enable_ssh` now also supports enabling the SSH service with a ssh_enabled_timeout

Signed-off-by: Lunfan Zhang <[email protected]>
Updated `records.ml` file to support `host-param-set/get/list` and `pool-param-set/get/list` for SSH-related fields.

Signed-off-by: Lunfan Zhang <[email protected]>
Implemented XAPI APIs:
  - `set_ssh_enabled_timeout`
  - `set_console_idle_timeout`

These APIs allow XAPI to configure timeouts for the SSH service and idle
console sessions from both host and pool level.

Updated `records.ml` to support `host-param-set/get/list` and
`pool-param-set/get/list` for SSH-related fields.
The error set_console_idle_timeout_failed was added in feature branch
while it is not used anywhere. The error used in
set_console_idle_timeout now is invalid_value.

Signed-off-by: Gang Ji <[email protected]>
Signed-off-by: Zeroday BYTE <[email protected]>
 - Ensure host.enabled_ssh reflects the actual SSH service state on startup, in case it was manually changed by the user.

 - Reschedule the "disable SSH" job if:
   - host.ssh_enabled_timeout is set to a positive value, and
   - host.ssh_expiry is in the future.

 - Disable the SSH if:
   - host.ssh_enabled_timeout is set to a positive value, and
   - host.ssh_expiry is in the past.

Signed-off-by: Lunfan Zhang <[email protected]>
psafont and others added 27 commits June 10, 2025 13:46
Unfortunately mirage-crypto has accumulated breaking changes:
- Cstructs have been replaced with strings
- The digestif library has replaced ad-hoc hash implementation

A deprecation has happened as well:
- RNG initialization has changed

Because there are breaking changes, xs-opam changes need to be
introduced at the same time:
xapi-project/xs-opam#731

Only xapi is affected by the breaking builds, so no other toolstack
repositories have incoming PRs.
I've tested builds with  Smoke and validation tests: SR 218740

This means that the merge will be done as such:
- Both PRs are approved
- First I will merge xs-opam will be emrged (with failing CI)
- Then this PR will be merged with the merge train that runs tests
before actually merging.
- CI is rerun manually on xs-opam to make it green again

After merging both xenserver's CI should create a successful build with
both PR included
The /sys/fs/cgroup/systemd/cgroup.procs file is not always present,
particularly in updated Linux systems with newer cgroup and SystemD. So
fallback to root /sys/fs/cgroup/cgroup.procs.
Also handle and report errors back to Ocaml.

Although SystemD discourage handling cgroups without service
configuration changes the root cgroup is a bit special as receiving
processes from multiple sources, including the kernel.
Also fix the Makefile, so that 'make clean' also deletes the `.o.d` files.

This avoids accidentally adding these files to git
(although normally dune would invoke make in _build,
 only if you manually invoke it would it create these extra files):
```
A ocaml/forkexecd/helper/close_from.o
A ocaml/forkexecd/helper/close_from.o.d
A ocaml/forkexecd/helper/syslog.o
A ocaml/forkexecd/helper/syslog.o.d
A ocaml/forkexecd/helper/vfork_helper
A ocaml/forkexecd/helper/vfork_helper.o
A ocaml/forkexecd/helper/vfork_helper.o.d
```

Signed-off-by: Edwin Török <[email protected]>
Also fix the Makefile, so that 'make clean' also deletes the `.o.d`
files.

This avoids accidentally adding these files to git (although normally
dune would invoke make in _build,
 only if you manually invoke it would it create these extra files):
```
A ocaml/forkexecd/helper/close_from.o
A ocaml/forkexecd/helper/close_from.o.d
A ocaml/forkexecd/helper/syslog.o
A ocaml/forkexecd/helper/syslog.o.d
A ocaml/forkexecd/helper/vfork_helper
A ocaml/forkexecd/helper/vfork_helper.o
A ocaml/forkexecd/helper/vfork_helper.o.d
```
on every toolstack restart on hosts without Nvidia GPUs, xapi complains about a
non-existent directory:

xapi: [error||0 |dbsync (update_env) R:733fc2551767|xapi_vgpu_type]
Failed to create NVidia compat config_file: Sys_error("/usr/share/nvidia/vgx: No such file or directory")

Handle the directory's absence without propagating the error.

Signed-off-by: Andrii Sultanov <[email protected]>
…_line

Otherwise this quite frequently logs something like:

```
xcp-networkd: [error||22 |dbsync (update_env)|network_utils]
Error in read one line of file: /sys/class/net/eth0/device/sriov_totalvfs,
exception Unix.Unix_error(Unix.ENOENT, "open", "/sys/class/net/eth0/device/sriov_totalvfs")
Backtrace ...
```

Signed-off-by: Andrii Sultanov <[email protected]>
non_debug_receive will dump an error after reading the last bits of the header,
which is expected and handled by the caller appropriately:

```
xenopsd-xc: [error||67 |Async.VM.resume R:beac7be348f1|xenguesthelper]
Memory F 6019464 KiB S 0 KiB T 8183 MiB <--- dumping error

xenopsd-xc: [debug||67 |Async.VM.resume R:beac7be348f1|mig64]
Finished emu-manager result processing <---- End_of_file expected and handled
```

Don't pollute the logs and instead just log the same info with 'debug' when the
error is End_of_file.

Signed-off-by: Andrii Sultanov <[email protected]>
It's an in-house joke that most of `xapi`'s errors are "expected"
(either because they're handled by the caller after being logged or
because they are frequent and benign) and knowing which ones are unusual
and actually critical takes a lot of learning. Try to remove at least
some of the most obvious ones from the toolstack startup/VM resume,
either removing the logs completely or downgrading it to `debug` when
appropriate.

Manually tested in the appropriate scenarios.
…ions

While one could potentially filter for this "value", I don't think it's that
useful and adds noise to the completions, like here:

```
$ xe vm-list resident-on=
64c11cad-2c52-4dea-aea6-5fae0e720699
\<not\ in\ database\>
7f566729-0ee7-47c4-853d-2c5f3a195ad4
```

Signed-off-by: Andrii Sultanov <[email protected]>
log needs to be moved one line below the first assignment into the
"description" variable, otherwise it's always going to be an empty string

Signed-off-by: Andrii Sultanov <[email protected]>
We used to rely on Bash's completion removing duplicate entries from the
suggestions, but processing them in the first place is unnecessary (and will
slow down completion since there's usually an 'xe' command run for each entry
in the wordlist).

Signed-off-by: Andrii Sultanov <[email protected]>
Provide a helpful description for some parameters of 'xe vm-list', compare
before:
```
$ xe vm-list resident-on=
64c11cad-2c52-4dea-aea6-5fae0e720699
7f566729-0ee7-47c4-853d-2c5f3a195ad4
```

with after:
```
$ xe vm-list resident-on=
64c11cad-2c52-4dea-aea6-5fae0e720699  - hpmc30
7f566729-0ee7-47c4-853d-2c5f3a195ad4  - hpmc29
```

Signed-off-by: Andrii Sultanov <[email protected]>
No completion was provided before, and it's handled properly now:
```
$ xe vm-list suspend-SR-uuid=
08906228-cbf6-dad4-720d-e581df11510a  - SR1
37b734f0-e594-0e48-2114-cd063241dd36  - SR2
```

Signed-off-by: Andrii Sultanov <[email protected]>
Handle a few more cases, fix logging, reduce slowdown
from https://en.wikipedia.org/wiki/Passwd#Password_file
uid_info as following format

username:password:uid:gid:gecos:homedir:shell

Regarding gecos, it is recommended as follows

Typically, this is a set of comma-separated values including the user's
full name and contact details.

However, this information comes form AD and user may mis-configure
it with `:`, which is used as seperator. In such case, the parse
would failed.

Enhance the parse function to support `:` in gecos, other fields does
not likely contain it.

Signed-off-by: Lin Liu <[email protected]>
from https://en.wikipedia.org/wiki/Passwd#Password_file uid_info as
following format

username:password:uid:gid:gecos:homedir:shell

Regarding gecos, it is recommended as follows

Typically, this is a set of comma-separated values including the user's
full name and contact details.

However, this information comes form AD and user may mis-configure it
with `:`, which is used as seperator. In such case, the parse would
failed.

Enhance the parse function to support `:` in gecos, other fields does
not likely contain it.
Instruments:
- `VM.add`,
- `VM.stat`,
- `VM.exists`,
- `VM.list`.

Signed-off-by: Gabriel Buica <[email protected]>
Instruments `switch_rpc` according to OpenTelemetry standard on
instrumenting rpc calls.
- `server.address` is the name of the message queue.

Intruments sending the message on a queue according to OpenTelemetry
standard on instrumenting messaging.
- `destination` is the name of the message queue.

`Tracing.with_tracing` now accepts an optional argument to set the Span
Kind.

Signed-off-by: Gabriel Buica <[email protected]>
Instrument xenops vm non-atomic functions.
Instruments:
- `VM.add`,
- `VM.stat`,
- `VM.exists`,
- `VM.list`.

Instruments `switch_rpc` according to OpenTelemetry standard on
instrumenting rpc calls.
- `server.address` is the name of the message queue.

Intruments sending the message on a queue according to OpenTelemetry
standard on instrumenting messaging.
- `destination` is the name of the message queue.

`Tracing.with_tracing` now accepts an optional argument to set the Span
Kind.

![image](https://github.com/user-attachments/assets/b58a3309-6e31-4e40-84a5-c053bc00d5c8)
Maintenance mode is entered by running Host.evacuate, followed
by promoting a new pool coordinator and shutting down XAPI.

We only export spans every 30s, so we may miss exporting the span for Host.evacuate.
Ensure that we at least trigger the export when XAPI is about to shutdown.
Do not wait for the export to finish, because this could take a long time
(e.g. when exporting to a remote Jaeger instance).

After this change I now see Host.evacuate properly in the exported trace.

Signed-off-by: Edwin Török <[email protected]>
json_reformat cannot handle newline delimited json, it is easier if we have a command to reformat it ourselves.

This can be useful when debugging why a trace is missing elements. Traces are stored as newline-delimited JSON
in /var/log/dt/zipkinv2/json, however json_reformat cannot process them directly, and the lines can be very long and difficult to read otherwise.

Signed-off-by: Edwin Török <[email protected]>
#6525)

Soon after Host.evacuate XAPI could be restarted (e.g. on coordinator
promotion).
But we only export traces every 30s, so we lose the spans from the last
30s, including the toplevel Host.evacuate span (which although long
running is only emitted on completion).

After this change I'm now able to see Host.evacuate and all the migrate
calls in the exported distributed trace.
The renaming is done to resolve one problem with rrd2csv: rrd2csv
selected both RRD1("runnable") and RRD2("runnable_vcpus") when the RRD1
name is used:

> rrd2csv AVERAGE:vm:<vm-uuid>:runnable
> timestamp, AVERAGE:vm:<vm-uuid>:runnable, AVERAGE:vm:<vm-uuid>:runnable_vcpus

This is because "runnable" is a prefix of "runnable_vcpus".

After the renaming, with rrd2csv:
* can select only RRD1 if we use:
  rrd2csv AVERAGE:vm:<vm-uuid>:runnable_any
* can select only RRD2 if we use:
  rrd2csv AVERAGE:vm:<vm-uuid>:runnable_vcpus
* can select both RRD1 and RRD2 if we use:
  rrd2csv AVERAGE:vm:<vm-uuid>:runnable

And the renaming also makes it clearer as the 'runnable' metric is % of
time that at least one vCPU of the domain is in the runnable state.

Signed-off-by: Gang Ji <[email protected]>
Follow the fix here:
#6493

Signed-off-by: Gang Ji <[email protected]>
@gangj gangj requested a review from mg12 June 18, 2025 04:20
@psafont
Copy link
Member

psafont commented Jun 18, 2025

I think a merge from master to the feature branch would be much better (I actually would prefer a rebase, seeing that this branch is only two patches right now)

@gangj
Copy link
Contributor Author

gangj commented Jun 19, 2025

I think a merge from master to the feature branch would be much better (I actually would prefer a rebase, seeing that this branch is only two patches right now)

Thanks for the review. But feature/vcpu-runnable has not been merged into master yet.
@mg12 , Or do we really need the feature/numa8 branch? As the recent decision is that: only RRD1 and RRD2 will be delivered for XS8, and feature/vcpu-runnable will include all code change for RRD1 and RRD2, I think then feature/numa8 will have no difference to feature/vcpu-runnable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.