diff --git a/doc/content/toolstack/features/DR/dr.png b/doc/content/toolstack/features/DR/dr.png new file mode 100644 index 0000000000..6c91120965 Binary files /dev/null and b/doc/content/toolstack/features/DR/dr.png differ diff --git a/doc/content/toolstack/features/DR/index.md b/doc/content/toolstack/features/DR/index.md new file mode 100644 index 0000000000..a958cb9f27 --- /dev/null +++ b/doc/content/toolstack/features/DR/index.md @@ -0,0 +1,29 @@ ++++ +title = "Disaster Recovery" ++++ + +The [HA](../HA/HA.html) feature will restart VMs after hosts have failed, but what +happens if a whole site (e.g. datacenter) is lost? A disaster recovery +configuration is shown in the following diagram: + +![Disaster recovery maintaining a secondary site](dr.png) + +We rely on the storage array's built-in mirroring to replicate (synchronously +or asynchronously: the admin's choice) between the primary and the secondary +site. When DR is enabled the VM disk data and VM metadata are written to the +storage server and mirrored. The secondary site contains the other side +of the data mirror and a set of hosts, which may be powered off. + +In normal operation, the DR feature allows a "dry-run" recovery where a host +on the secondary site checks that it can indeed see all the VM disk data +and metadata. This should be done regularly, so that admins are familiar with +the process. + +After a disaster, the admin breaks the mirror on the secondary site and triggers +a remote power-on of the offline hosts (either using an out-of-band tool or + the built-in host power-on feature of xapi). The pool master on the secondary +site can connect to the storage and extract all the VM metadata. Finally the +VMs can all be restarted. + +When the primary site is fully recovered, the mirror can be re-synchronised +and the VMs can be moved back. diff --git a/doc/content/toolstack/features/HA/HA.configure.msc b/doc/content/toolstack/features/HA/HA.configure.msc new file mode 100644 index 0000000000..43eafd7ff9 --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.configure.msc @@ -0,0 +1,11 @@ +participant slave1 +Note over master: Host.enable_ha\nchoose an SR\nfind or create VDIs\nattach VDIs\nwrite xhad.conf\nha_set_pool_state init +master->slave1: Host.preconfigure_ha +Note over slave1: attach VDIs\nwrite xhad.conf\n +master->slave2: Host.preconfigure_ha +Note over slave2: attach VDIs\nwrite xhad.conf\n +master->slave1: Host.ha_join_liveset +master->slave2: Host.ha_join_liveset +Note over master: ha_propose_master +slave1-->master: wait for master +slave2-->master: wait for master diff --git a/doc/content/toolstack/features/HA/HA.configure.svg b/doc/content/toolstack/features/HA/HA.configure.svg new file mode 100644 index 0000000000..f00fc1e4b6 --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.configure.svg @@ -0,0 +1,11 @@ +participant slave1 +Note over master: Host.enable_ha\nchoose an SR\nfind or create VDIs\nattach VDIs\nwrite xhad.conf\nha_set_pool_state init +master->slave1: Host.preconfigure_ha +Note over slave1: attach VDIs\nwrite xhad.conf\n +master->slave2: Host.preconfigure_ha +Note over slave2: attach VDIs\nwrite xhad.conf\n +master->slave1: Host.ha_join_liveset +master->slave2: Host.ha_join_liveset +Note over master: ha_propose_master +slave1-->master: wait for master +slave2-->master: wait for masterCreated with RaphaĆ«l 2.1.0 \ No newline at end of file diff --git a/doc/content/toolstack/features/HA/HA.disable.clean.msc b/doc/content/toolstack/features/HA/HA.disable.clean.msc new file mode 100644 index 0000000000..290ea23e3f --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.disable.clean.msc @@ -0,0 +1,6 @@ +Master Xapi->Master Xhad: ha_set_pool_state Invalid +Master Xhad->Master Xapi: OK +Note over Slave Xhad: heartbeat thread notices\ninvalid state and disarms +Slave Xapi-->Slave Xhad: ha_query_liveset +Slave Xhad-->Slave Xapi: Invalid +Note over Slave Xapi: disable HA, cleanup diff --git a/doc/content/toolstack/features/HA/HA.disable.clean.svg b/doc/content/toolstack/features/HA/HA.disable.clean.svg new file mode 100644 index 0000000000..6d47269df2 --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.disable.clean.svg @@ -0,0 +1,7 @@ +Master Xapi->Master Xhad: ha_set_pool_state Invalid +Master Xhad->Master Xapi: OK +Note over Slave Xhad: heartbeat thread notices\ninvalid state and disarms +Slave Xapi-->Slave Xhad: ha_query_liveset +Slave Xhad-->Slave Xapi: Invalid +Note over Slave Xapi: disable HA, cleanup +Created with RaphaĆ«l 2.1.0 \ No newline at end of file diff --git a/doc/content/toolstack/features/HA/HA.disable.unclean.msc b/doc/content/toolstack/features/HA/HA.disable.unclean.msc new file mode 100644 index 0000000000..3e5881329b --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.disable.unclean.msc @@ -0,0 +1,5 @@ +Note over Xapi: disable HA recovery\nlogic; user has manual\ncontrol +Xapi->Xhad: ha_disarm_fencing +Xhad->Xapi: OK +Xapi->Xhad: ha_stop_daemon +Xhad->Xapi: OK diff --git a/doc/content/toolstack/features/HA/HA.disable.unclean.svg b/doc/content/toolstack/features/HA/HA.disable.unclean.svg new file mode 100644 index 0000000000..1312851148 --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.disable.unclean.svg @@ -0,0 +1,6 @@ +Note over Xapi: disable HA recovery\nlogic; user has manual\ncontrol +Xapi->Xhad: ha_disarm_fencing +Xhad->Xapi: OK +Xapi->Xhad: ha_stop_daemon +Xhad->Xapi: OK +Created with RaphaĆ«l 2.1.0 \ No newline at end of file diff --git a/doc/content/toolstack/features/HA/HA.shutdown.msc b/doc/content/toolstack/features/HA/HA.shutdown.msc new file mode 100644 index 0000000000..1b34d3316b --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.shutdown.msc @@ -0,0 +1,8 @@ +Note over Xapi: all VMs shutdown\nall VDIs unlocked +Xapi->Xhad: ha_disarm_fencing +Xhad->Xapi: OK +Xapi->Xhad: ha_stop_daemon +Xhad->Xapi: OK +Note over Xhad: daemon exits +Xapi->Statefile: ha_set_excluded +Note over Statefile: host will not be included\nin liveset calculations until\nafter reboot diff --git a/doc/content/toolstack/features/HA/HA.shutdown.svg b/doc/content/toolstack/features/HA/HA.shutdown.svg new file mode 100644 index 0000000000..a751cbb8a7 --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.shutdown.svg @@ -0,0 +1,8 @@ +Note over Xapi: all VMs shutdown\nall VDIs unlocked +Xapi->Xhad: ha_disarm_fencing +Xhad->Xapi: OK +Xapi->Xhad: ha_stop_daemon +Xhad->Xapi: OK +Note over Xhad: daemon exits +Xapi->Statefile: ha_set_excluded +Note over Statefile: host will not be included\nin liveset calculations until\nafter rebootCreated with RaphaĆ«l 2.1.0 \ No newline at end of file diff --git a/doc/content/toolstack/features/HA/HA.start.msc b/doc/content/toolstack/features/HA/HA.start.msc new file mode 100644 index 0000000000..a73c28139b --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.start.msc @@ -0,0 +1,12 @@ +Xapi->Xhad: ha_start_daemon +Note over Xhad: Starts talking to other hosts\nto form or join the liveset +Xapi-->Xhad: ha_query_liveset +Xhad-->Xapi: Starting +Note over Xhad: liveset joined and\nexcluded flag cleared +Xhad->Xapi: OK +Xapi-->Xhad: ha_query_liveset +Xhad-->Xapi: Online +Note over Xapi: If starting HA and am a master\n already or if responding to a failure\nwhere the master may have failed. +Xapi->Xhad: ha_propose_master +Note over Xhad: at most one host can be a master +Xhad->Xapi: TRUE/FALSE diff --git a/doc/content/toolstack/features/HA/HA.start.svg b/doc/content/toolstack/features/HA/HA.start.svg new file mode 100644 index 0000000000..68292c4a4a --- /dev/null +++ b/doc/content/toolstack/features/HA/HA.start.svg @@ -0,0 +1,13 @@ +Xapi->Xhad: ha_start_daemon +Note over Xhad: Starts talking to other hosts\nto form or join the liveset +Xapi-->Xhad: ha_query_liveset +Xhad-->Xapi: Starting +Note over Xhad: liveset joined and\nexcluded flag cleared +Xhad->Xapi: OK +Xapi-->Xhad: ha_query_liveset +Xhad-->Xapi: Online +Note over Xapi: If starting HA and am a master\n already or if responding to a failure\nwhere the master may have failed. +Xapi->Xhad: ha_propose_master +Note over Xhad: at most one host can be a master +Xhad->Xapi: TRUE/FALSE +Created with RaphaĆ«l 2.1.0 \ No newline at end of file diff --git a/doc/content/toolstack/features/HA/ha.png b/doc/content/toolstack/features/HA/ha.png new file mode 100644 index 0000000000..c7661ca506 Binary files /dev/null and b/doc/content/toolstack/features/HA/ha.png differ diff --git a/doc/content/toolstack/features/HA/index.md b/doc/content/toolstack/features/HA/index.md new file mode 100644 index 0000000000..45918ac926 --- /dev/null +++ b/doc/content/toolstack/features/HA/index.md @@ -0,0 +1,849 @@ ++++ +title = "High-Availability" ++++ + +High-Availability (HA) tries to keep VMs running, even when there are hardware +failures in the resource pool, when the admin is not present. Without HA +the following may happen: + +- during the night someone spills a cup of coffee over an FC switch; then +- VMs running on the affected hosts will lose access to their storage; then +- business-critical services will go down; then +- monitoring software will send a text message to an off-duty admin; then +- the admin will travel to the office and fix the problem by restarting + the VMs elsewhere. + +With HA the following will happen: + +- during the night someone spills a cup of coffee over an FC switch; then +- VMs running on the affected hosts will lose access to their storage; then +- business-critical services will go down; then +- the HA software will determine which hosts are affected and shut them down; then +- the HA software will restart the VMs on unaffected hosts; then +- services are restored; then *on the next working day* +- the admin can arrange for the faulty switch to be replaced. + +HA is designed to handle an emergency and allow the admin time to fix +failures properly. + +Example +======= + +The following diagram shows an HA-enabled pool, before and after a network +link between two hosts fails. + +![High-Availability in action](ha.png) + +When HA is enabled, all hosts in the pool + +- exchange periodic heartbeat messages over the network +- send heartbeats to a shared storage device. +- attempt to acquire a "master lock" on the shared storage. + +HA is designed to recover as much as possible of the pool after a single failure +i.e. it removes single points of failure. When some subset of the pool suffers +a failure then the remaining pool members + +- figure out whether they are in the largest fully-connected set (the + "liveset"); + - if they are not in the largest set then they "fence" themselves (i.e. + force reboot via the hypervisor watchdog) +- elect a master using the "master lock" +- restart all lost VMs. + +After HA has recovered a pool, it is important that the original failure is +addressed because the remaining pool members may not be able to cope with +any more failures. + +Design +====== + +HA must never violate the following safety rules: + +1. there must be at most one master at all times. This is because the master + holds the VM and disk locks. +2. there must be at most one instance of a particular VM at all times. This + is because starting the same VM twice will result in severe filesystem + corruption. + +However to be useful HA must: + +- detect failures quickly; +- minimise the number of false-positives in the failure detector; and +- make the failure handling logic as robust as possible. + +The implementation difficulty arises when trying to be both useful and safe +at the same time. + +Terminology +----------- + +We use the following terminology: + +- *fencing*: also known as I/O fencing, refers to the act of isolating a + host from network and storage. Once a host has been fenced, any VMs running + there cannot generate side-effects observable to a third party. This means + it is safe to restart the running VMs on another node without violating the + safety-rule and running the same VM simultaneously in two locations. +- *heartbeating*: exchanging status updates with other hosts at regular + pre-arranged intervals. Heartbeat messages reveal that hosts are alive + and that I/O paths are working. +- *statefile*: a shared disk (also known as a "quorum disk") on the "Heartbeat" + SR which is mapped as a block device into every host's domain 0. The shared + disk acts both as a channel for heartbeat messages and also as a building + block of a Pool master lock, to prevent multiple hosts becoming masters in + violation of the safety-rule (a dangerous situation also known as + "split-brain"). +- *management network*: the network over which the XenAPI XML/RPC requests + flow and also used to send heartbeat messages. +- *liveset*: a per-Host view containing a subset of the Hosts in the Pool + which are considered by that Host to be alive i.e. responding to XenAPI + commands and running the VMs marked as `resident_on` there. When a Host `b` + leaves the liveset as seen by Host `a` it is safe for Host `a` to assume + that Host `b` has been fenced and to take recovery actions (e.g. restarting + VMs), without violating either of the safety-rules. +- *properly shared SR*: an SR which has field `shared=true`; and which has a + `PBD` connecting it to every `enabled` Host in the Pool; and where each of + these `PBD`s has field `currently_attached` set to true. A VM whose disks + are in a properly shared SR could be restarted on any `enabled` Host, + memory and network permitting. +- *properly shared Network*: a Network which has a `PIF` connecting it to + every `enabled` Host in the Pool; and where each of these `PIF`s has + field `currently_attached` set to true. A VM whose VIFs connect to + properly shared Networks could be restarted on any `enabled` Host, + memory and storage permitting. +- *agile*: a VM is said to be agile if all disks are in properly shared SRs + and all network interfaces connect to properly shared Networks. +- *unprotected*: an unprotected VM has field `ha_always_run` set to false + and will never be restarted automatically on failure + or have reconfiguration actions blocked by the HA overcommit protection. +- *best-effort*: a best-effort VM has fields `ha_always_run` set to true and + `ha_restart_priority` set to best-effort. + A best-effort VM will only be restarted if (i) the failure is directly + observed; and (ii) capacity exists for an immediate restart. + No more than one restart attempt will ever be made. +- *protected*: a VM is said to be protected if it will be restarted by HA + i.e. has field `ha_always_run` set to true and + field `ha_restart_priority` not set to `best-effort. +- *survival rule 1*: describes the situation where hosts survive + because they are in the largest network partition with statefile access. + This is the normal state of the `xhad` daemon. +- *survival rule 2*: describes the situation where *all* hosts have lost + access to the statefile but remain alive + while they can all see each-other on the network. In this state any further + failure will cause all nodes to self-fence. + This state is intended to cope with the system-wide temporary loss of the + storage service underlying the statefile. + +Assumptions +----------- + +We assume: + +- All I/O used for monitoring the health of hosts (i.e. both storage and + network-based heartbeating) is along redundant paths, so that it survives + a single hardware failure (e.g. a broken switch or an accidentally-unplugged + cable). It is up to the admin to ensure their environment is setup correctly. +- The hypervisor watchdog mechanism will be able to guarantee the isolation + of nodes, once communication has been lost, within a pre-arranged time + period. Therefore no active power fencing equipment is required. +- VMs may only be marked as *protected* if they are fully *agile* i.e. able + to run on any host, memory permitting. No additional constraints of any kind + may be specified e.g. it is not possible to make "CPU reservations". +- Pools are assumed to be homogenous with respect to CPU type and presence of + VT/SVM support (also known as "HVM"). If a Pool is created with + non-homogenous hosts using the `--force` flag then the additional + constraints will not be noticed by the VM failover planner resulting in + runtime failures while trying to execute the failover plans. +- No attempt will ever be made to shutdown or suspend "lower" priority VMs + to guarantee the survival of "higher" priority VMs. +- Once HA is enabled it is not possible to reconfigure the management network + or the SR used for storage heartbeating. +- VMs marked as *protected* are considered to have failed if they are offline + i.e. the VM failure handling code is level-sensitive rather than + edge-sensitive. +- VMs marked as *best-effort* are considered to have failed only when the host + where they are resident is declared offline + i.e. the best-effort VM failure handling code is edge-sensitive rather than + level-sensitive. + A single restart attempt is attempted and if this fails no further start is + attempted. +- HA can only be enabled if all Pool hosts are online and actively responding + to requests. +- when HA is enabled the database is configured to write all updates to + the "Heartbeat" SR, guaranteeing that VM configuration changes are not lost + when a host fails. + +Components +---------- + +The implementation is split across the following components: + +- [xhad](https://github.com/xenserver/xha): the cluster membership daemon + maintains a quorum of hosts through network and storage heartbeats +- [xapi](https://github.com/xapi-project/xen-api): used to configure the + HA policy i.e. which network and storage to use for heartbeating and which + VMs to restart after a failure. +- [xen](http://xenproject.org/): the Xen watchdog is used to reliably + fence the host when the host has been (partially or totally) isolated + from the cluster + +To avoid a "split-brain", the cluster membership daemon must "fence" (i.e. +isolate) nodes when they are not part of the cluster. In general there are +2 approaches: + +- cut the power of remote hosts which you can't talk to on the network + any more. This is the approach taken by most open-source clustering + software since it is simpler. However it has the downside of requiring + the customer buy more hardware and set it up correctly. +- rely on the remote hosts using a watchdog to cut their own power (i.e. + halt or reboot) after a timeout. This relies on the watchdog being + reliable. Most other people [don't trust the Linux watchdog](https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html); + after all the Linux kernel is highly threaded, performs a lot of (useful) + functions and kernel bugs which result in deadlocks do happen. + We use the Xen watchdog because we believe that the Xen hypervisor is + simple enough to reliably fence the host (via triggering a reboot of + domain 0 which then triggers a host reboot). + +xhad +==== + +[xhad](https://github.com/xenserver/xha) is the cluster membership daemon: +it exchanges heartbeats with the other nodes to determine which nodes are +still in the cluster (the "live set") and which nodes have *definitely* +failed (through watchdog fencing). When a host has definitely failed, xapi +will unlock all the disks and restart the VMs according to the HA policy. + +Since Xapi is a critical part of the system, the xhad also acts as a +Xapi watchdog. It polls Xapi every few seconds and checks if Xapi can +respond. If Xapi seems to have failed then xhad will restart it. If restarts +continue to fail then xhad will consider the host to have failed and +self-fence. + +xhad is configured via a simple config file written on each host in +`/etc/xensource/xhad.conf`. The file must be identical on each host +in the cluster. To make changes to the file, HA must be disabled and then +re-enabled afterwards. Note it may not be possible to re-enable HA depending +on the configuration change (e.g. if a host has been added but that host has +a broken network configuration then this will block HA enable). + +The xhad.conf file is written in XML and contains + +- pool-wide configuration: this includes a list of all hosts which should + be in the liveset and global timeout information +- local host configuration: this identifies the local host and described + which local network interface and block device to use for heartbeating. + +The following is an example xhad.conf file: + +```xml + + + + + + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + 694 + + + + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + xxx.xxx.xxx.xx1 + + + + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + xxx.xxx.xxx.xx2 + + + + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + xxx.xxx.xxx.xx3 + + + + + 4 + 30 + 4 + 30 + 30 + 45 + 90 + 90 + 60 + 10 + 1 + 30 + 30 + + + + + + + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2 + xapi1 + bond0 + /dev/statefiledevicename + + + + +``` + +The fields have the following meaning: + +- GenerationUUID: a UUID generated each time HA is reconfigured. This allows + xhad to tell an old host which failed; had been removed from the + configuration; repaired and then restarted that the world has changed + while it was away. +- UDPport: the port number to use for network heartbeats. It's important + to allow this traffic through the firewall and to make sure the same + port number is free on all hosts (beware of portmap services occasionally + binding to it). +- HostID: a UUID identifying a host in the pool. We would normally use + xapi's notion of a host uuid. +- IPaddress: any IP address on the remote host. We would normally use + xapi's notion of a management network. +- HeartbeatTimeout: if a heartbeat packet is not received for this many + seconds, then xhad considers the heartbeat to have failed. This is + the user-supplied "HA timeout" value, represented below as `T`. + `T` must be bigger than 10; we would normally use 60s. +- StateFileTimeout: if a storage update is not seen for a host for this + many seconds, then xhad considers the storage heartbeat to have failed. + We would normally use the same value as the HeartbeatTimeout `T`. +- HeartbeatInterval: interval between heartbeat packets sent. We would + normally use a value `2 <= t <= 6`, derived from the user-supplied + HA timeout via `t = (T + 10) / 10` +- StateFileInterval: interval betwen storage updates (also known as + "statefile updates"). This would normally be set to the same value as + HeartbeatInterval. +- HeartbeatWatchdogTimeout: If the host does not send a heartbeat for this + amount of time then the host self-fences via the Xen watchdog. We normally + set this to `T`. +- StateFileWatchdogTimeout: If the host does not update the statefile for + this amount of time then the host self-fences via the Xen watchdog. We + normally set this to `T+15`. +- BootJoinTimeout: When the host is booting and joining the liveset (i.e. + the cluster), consider the join a failure if it takes longer than this + amount of time. We would normally set this to `T+60`. +- EnableJoinTimeout: When the host is enabling HA for the first time, + consider the enable a failure if it takes longer than this amount of time. + We would normally set this to `T+60`. +- XapiHealthCheckInterval: Interval between "health checks" where we run + a script to check whether Xapi is responding or not. +- XapiHealthCheckTimeout: Number of seconds to wait before assuming that + Xapi has deadlocked during a "health check". +- XapiRestartAttempts: Number of Xapi restarts to attempt before concluding + Xapi has permanently failed. +- XapiRestartTimeout: Number of seconds to wait for a Xapi restart to + complete before concluding it has failed. +- XapiLicenseCheckTimeout: Number of seconds to wait for a Xapi license + check to complete before concluding that xhad should terminate. + +In addition to the config file, Xhad exposes a simple control API which +is exposed as scripts: + +- `ha_set_pool_state (Init | Invalid)`: sets the global pool state to "Init" (before starting + HA) or "Invalid" (causing all other daemons who can see the statefile to + shutdown) +- `ha_start_daemon`: if the pool state is "Init" then the daemon will + attempt to contact other daemons and enable HA. If the pool state is + "Active" then the host will attempt to join the existing liveset. +- `ha_query_liveset`: returns the current state of the cluster. +- `ha_propose_master`: returns whether the current node has been + elected pool master. +- `ha_stop_daemon`: shuts down the xhad on the local host. Note this + will not disarm the Xen watchdog by itself. +- `ha_disarm_fencing`: disables fencing on the local host. +- `ha_set_excluded`: when a host is being shutdown cleanly, record the + fact that the VMs have all been shutdown so that this host can be ignored + in future cluster membership calculations. + +Fencing +------- + +Xhad continuously monitors whether the host should remain alive, or if +it should self-fence. There are two "survival rules" which will keep a host +alive; if neither rule applies (or if xhad crashes or deadlocks) then the +host will fence. The rules are: + +1. Xapi is running; the storage heartbeats are visible; this host is a + member of the "best" partition (as seen through the storage heartbeats) +2. Xapi is running; the storage is inaccessible; all hosts which should + be running (i.e. not those "excluded" by being cleanly shutdown) are + online and have also lost storage access (as seen through the network + heartbeats). + +where the "best" partition is the largest one if that is unique, or if there +are multiple partitions of the same size then the one containing the lowest +host uuid is considered best. + +The first survival rule is the "normal" case. The second rule exists only +to prevent the storage from becoming a single point of failure: all hosts +can remain alive until the storage is repaired. Note that if a host has +failed and has not yet been repaired, then the storage becomes a single +point of failure for the degraded pool. HA removes single point of failures, +but multiple failures can still cause problems. It is important to fix +failures properly after HA has worked around them. + + +xapi +==== + +[Xapi](https://github.com/xapi-project/xen-api) is responsible for + +- exposing an interface for setting HA policy +- creating VDIs (disks) on shared storage for heartbeating and storing + the pool database +- arranging for these disks to be attached on host boot, before the "SRmaster" + is online +- configuring and managing the `xhad` heartbeating daemon + +The HA policy APIs include + +- methods to determine whether a VM is *agile* i.e. can be restarted in + principle on any host after a failure +- planning for a user-specified number of host failures and enforcing + access control +- restarting failed *protected* VMs in policy order + +The HA policy settings are stored in the Pool database which is written +(synchronously) +to a VDI in the same SR that's being used for heartbeating. This ensures +that the database can be recovered after a host fails and the VMs are +recovered. + +Xapi stores 2 settings in its local database: + +- *ha_disable_failover_actions*: this is set to false when we want nodes + to be able to recover VMs -- this is the normal case. It is set to true + during the HA disable process to prevent a split-brain forming while + HA is only partially enabled. +- *ha_armed*: this is set to true to tell Xapi to start `Xhad` during + host startup and wait to join the liveset. + +Disks on shared storage +----------------------- + +The regular disk APIs for creating, destroying, attaching, detaching (etc) +disks need the `SRmaster` (usually but not always the Pool master) to be +online to allow the disks to be locked. The `SRmaster` cannot be brought +online until the host has joined the liveset. Therefore we have a +cyclic dependency: joining the liveset needs the statefile disk to be attached +but attaching a disk requires being a member of the liveset already. + +The dependency is broken by adding an explicit "unlocked" attach storage +API called `VDI_ATTACH_FROM_CONFIG`. Xapi uses the `VDI_GENERATE_CONFIG` API +during the HA enable operation and stores away the result. When the system +boots the `VDI_ATTACH_FROM_CONFIG` is able to attach the disk without the +SRmaster. + +The role of Host.enabled +------------------------ + +The `Host.enabled` flag is used to mean, "this host is ready to start VMs and +should be included in failure planning". +The VM restart planner assumes for simplicity that all *protected* VMs can +be started anywhere; therefore all involved networks and storage must be +*properly shared*. +If a host with an unplugged `PBD` were to become enabled then the corresponding +`SR` would cease to be *properly shared*, all the VMs would cease to be +*agile* and the VM restart logic would fail. + +To ensure the VM restart logic always works, great care is taken to make +sure that Hosts may only become enabled when their networks and storage are +properly configured. This is achieved by: + +- when the master boots and initialises its database it sets all Hosts to + dead and disabled and then signals the HA background thread + ([signal_database_state_valid](https://github.com/xapi-project/xen-api/blob/0bbd4f5ac5fe46f9e982e5d5587ac56ed8427295/ocaml/xapi/xapi_ha.ml#L627)) + to wake up from sleep and + start processing liveset information (and potentially setting hosts to live) +- when a slave calls + [Pool.hello](https://github.com/xapi-project/xen-api/blob/0bbd4f5ac5fe46f9e982e5d5587ac56ed8427295/ocaml/xapi/xapi_pool.ml#L1019) + (i.e. after the slave has rebooted), + the master sets it to disabled, allowing it a grace period to plug in its + storage; +- when a host (master or slave) successfully plugs in its networking and + storage it calls + [consider_enabling_host](https://github.com/xapi-project/xen-api/blob/0bbd4f5ac5fe46f9e982e5d5587ac56ed8427295/ocaml/xapi/xapi_host_helpers.ml#L193) + which checks that the + preconditions are met and then sets the host to enabled; and +- when a slave notices its database connection to the master restart + (i.e. after the master `xapi` has just restarted) it calls + `consider_enabling_host}` + +The steady-state +---------------- + +When HA is enabled and all hosts are running normally then each calls +`ha_query_liveset` every 10s. + +Slaves check to see if the host they believe is the master is alive and has +the master lock. If another node has become master then the slave will +rewrite its `pool.conf` and restart. If no node is the master then the +slave will call +[on_master_failure](https://github.com/xapi-project/xen-api/blob/0bbd4f5ac5fe46f9e982e5d5587ac56ed8427295/ocaml/xapi/xapi_ha.ml#L129), +proposing itself and, if it is rejected, +checking the liveset to see which node acquired the lock. + +The master monitors the liveset and updates the `Host_metrics.live` flag +of every host to reflect the liveset value. For every host which is not in +the liveset (i.e. has fenced) it enumerates all resident VMs and marks them +as `Halted`. For each protected VM which is not running, the master computes +a VM restart plan and attempts to execute it. If the plan fails then a +best-effort `VM.start` call is attempted. Finally an alert is generated if +the VM could not be restarted. + +Note that XenAPI heartbeats are still sent when HA is enabled, even though +they are not used to drive the values of the `Host_metrics.live` field. +Note further that, when a host is being shutdown, the host is immediately +marked as dead and its host reference is added to a list used to prevent the +`Host_metrics.live` being accidentally reset back to live again by the +asynchronous liveset query. The Host reference is removed from the list when +the host restarts and calls `Pool.hello`. + +Planning and overcommit +----------------------- + +The VM failover planning code is sub-divided into two pieces, stored in +separate files: + +- [binpack.ml](https://github.com/xapi-project/xen-api/blob/0bbd4f5ac5fe46f9e982e5d5587ac56ed8427295/ocaml/xapi/binpack.ml): contains two algorithms for packing items of different sizes + (i.e. VMs) into bins of different sizes (i.e. Hosts); and +- [xapi_ha_vm_failover.ml](https://github.com/xapi-project/xen-api/blob/0bbd4f5ac5fe46f9e982e5d5587ac56ed8427295/ocaml/xapi/xapi_ha_vm_failover.ml): interfaces between the Pool database and the + binpacker; also performs counterfactual reasoning for overcommit protection. + +The input to the binpacking algorithms are configuration values which +represent an abstract view of the Pool: + +```ocaml +type ('a, 'b) configuration = { + hosts: ('a * int64) list; (** a list of live hosts and free memory *) + vms: ('b * int64) list; (** a list of VMs and their memory requirements *) + placement: ('b * 'a) list; (** current VM locations *) + total_hosts: int; (** total number of hosts in the pool 'n' *) + num_failures: int; (** number of failures to tolerate 'r' *) +} +``` +Note that: + +- the memory required by the VMs listed in `placement` has already been + substracted from the total memory of the hosts; it doesn't need to be + subtracted again. +- the free memory of each host has already had per-host miscellaneous + overheads subtracted from it, including that used by unprotected VMs, + which do not appear in the VM list. +- the total number of hosts in the pool (`total_hosts`) is a constant for + any particular invocation of HA. +- the number of failures to tolerate (`num_failures`) is the user-settable + value from the XenAPI `Pool.ha_host_failures_to_tolerate`. + + +There are two algorithms which satisfy the interface: + +```ocaml +sig + plan_always_possible: ('a, 'b) configuration -> bool; + get_specific_plan: ('a, 'b) configuration -> 'b list -> ('b * 'a) list +end +``` + +The function `get_specific_plan` takes a configuration and a list of Hosts +which have failed. It returns a VM restart plan represented as a VM to Host +association list. This is the function called by the +background HA VM restart thread on the master. + +The function `plan_always_possible` returns true if every sequence of Host +failures of length +`num_failures` (irrespective of whether all hosts failed at once, or in +multiple separate episodes) +would result in calls to `get_specific_plan` which would allow all protected +VMs to be restarted. +This function is heavily used by the overcommit protection logic as well as code in XenCenter which aims to +maximise failover capacity using the counterfactual reasoning APIs: + +```ocaml +Pool.ha_compute_max_host_failures_to_tolerate +Pool.ha_compute_hypothetical_max_host_failures_to_tolerate +``` + +There are two binpacking algorithms: the more detailed but expensive +algorithmm is used for smaller/less +complicated pool configurations while the less detailed, cheaper algorithm +is used for the rest. The +choice between algorithms is based only on `total_hosts` (`n`) and +`num_failures` (`r`). +Note that the choice of algorithm will only change if the number of Pool +hosts is varied (requiring HA to be disabled and then enabled) or if the +user requests a new `num_failures` target to plan for. + +The expensive algorithm uses an exchaustive search with a +"biggest-fit-decreasing" strategy that +takes the biggest VMs first and allocates them to the biggest remaining Host. +The implementation keeps the VMs and Hosts as sorted lists throughout. +There are a number of transformations to the input configuration which are +guaranteed to preserve the existence of a VM to host allocation (even if +the actual allocation is different). These transformations which are safe +are: + +- VMs may be removed from the list +- VMs may have their memory requirements reduced +- Hosts may be added +- Hosts may have additional memory added. + +The cheaper algorithm is used for larger Pools where the state space to +search is too large. It uses the same "biggest-fit-decreasing" strategy +with the following simplifying approximations: + +- every VM that fails is as big as the biggest +- the number of VMs which fail due to a single Host failure is always the + maximum possible (even if these are all very small VMs) +- the largest and most capable Hosts fail + +An informal argument that these approximations are safe is as follows: +if the maximum *number* of VMs fail, each of which is size of the largest +and we can find a restart plan using only the smaller hosts then any real +failure: + +- can never result in the failure of more VMs; +- can never result in the failure of bigger VMs; and +- can never result in less host capacity remaining. + +Therefore we can take this *almost-certainly-worse-than-worst-case* failure +plan and: + +- replace the remaining hosts in the worst case plan with the real remaining + hosts, which will be the same size or larger; and +- replace the failed VMs in the worst case plan with the real failed VMs, + which will be fewer or the same in number and smaller or the same in size. + +Note that this strategy will perform best when each host has the same number +of VMs on it and when all VMs are approximately the same size. If one very big +VM exists and a lot of smaller VMs then it will probably fail to find a plan. +It is more tolerant of differing amounts of free host memory. + +Overcommit protection +--------------------- + +Overcommit protection blocks operations which would prevent the Pool being +able to restart *protected* VMs after host failure. +The Pool may become unable to restart protected VMs in two general ways: +(i) by running out of resource i.e. host memory; and (ii) by altering host +configuration in such a way that VMs cannot be started (or the planner +thinks that VMs cannot be started). + +API calls which would change the amount of host memory currently in use +(`VM.start`, `VM.resume`, `VM.migrate` etc) +have been modified to call the planning functions supplying special +"configuration change" parameters. +Configuration change values represent the proposed operation and have type + +```ocaml +type configuration_change = { + (** existing VMs which are leaving *) + old_vms_leaving: (API.ref_host * (API.ref_VM * API.vM_t)) list; + (** existing VMs which are arriving *) + old_vms_arriving: (API.ref_host * (API.ref_VM * API.vM_t)) list; + (** hosts to pretend to disable *) + hosts_to_disable: API.ref_host list; + (** new number of failures to consider *) + num_failures: int option; + (** new VMs to restart *) + new_vms_to_protect: API.ref_VM list; +} +``` + +A VM migration will be represented by saying the VM is "leaving" one host and +"arriving" at another. A VM start or resume will be represented by saying the +VM is "arriving" on a host. + + +Note that no attempt is made to integrate the overcommit protection with the +general `VM.start` host chooser as this would be quite expensive. + +Note that the overcommit protection calls are written as `asserts` called +within the message forwarder in the master, holding the main forwarding lock. + +API calls which would change the system configuration in such a way as to +prevent the HA restart planner being able to guarantee to restart protected +VMs are also blocked. These calls include: + +- `VBD.create`: where the disk is not in a *properly shared* SR +- `VBD.insert`: where the CDROM is local to a host +- `VIF.create`: where the network is not *properly shared* +- `PIF.unplug`: when the network would cease to be *properly shared* +- `PBD.unplug`: when the storage would cease to be *properly shared* +- `Host.enable`: when some network or storage would cease to be + *properly shared* (e.g. if this host had a broken storage configuration) + + +xen +=== + +The Xen hypervisor has per-domain watchdog counters which, when enabled, +decrement as time passes and can be reset from a hypercall from the domain. +If the domain fails to make the hypercall and the timer reaches zero then +the domain is immediately shutdown with reason reboot. We configure Xen +to reboot the host when domain 0 enters this state. + +High-level operations +===================== + + +Enabling HA +----------- + +Before HA can be enabled the admin must take care to configure the +environment properly. In particular: + +- NIC bonds should be available for network heartbeats; +- multipath should be configured for the storage heartbeats; +- all hosts should be online and fully-booted. + +The XenAPI client can request a specific shared SR to be used for +storage heartbeats, otherwise Xapi will use the Pool's default SR. +Xapi will use `VDI_GENERATE_CONFIG` to ensure the disk will be attached +automatically on system boot before the liveset has been joined. + +Note that extra effort is made to re-use any existing heartbeat VDIS +so that + +- if HA is disabled with some hosts offline, when they are rebooted they + stand a higher chance of seeing a well-formed statefile with an explicit + *invalid* state. If the VDIs were destroyed on HA disable then hosts which + boot up later would fail to attach the disk and it would be harder to + distinguish between a temporary storage failure and a permanent HA disable. +- the heartbeat SR can be created on expensive low-latency high-reliability + storage and made as small as possible (to minimise infrastructure cost), + safe in the knowledge that if HA enables successfully once, it won't run + out of space and fail to enable in the future. + +The Xapi-to-Xapi communication looks as follows: + +![Configuring HA around the Pool](HA.configure.svg) + +The Xapi Pool master calls `Host.ha_join_liveset` on all hosts in the +pool simultaneously. Each host +runs the `ha_start_daemon` script +which starts Xhad. Each Xhad starts exchanging heartbeats over the network +and storage defined in the `xhad.conf`. + +Joining a liveset +----------------- + +![Starting up a host](HA.start.svg) + +The Xhad instances exchange heartbeats and decide which hosts are in +the "liveset" and which have been fenced. + +After joining the liveset, each host clears +the "excluded" flag which would have +been set if the host had been shutdown cleanly before -- this is only +needed when a host is shutdown cleanly and then restarted. + +Xapi periodically queries the state of xhad via the `ha_query_liveset` +command. The state will be `Starting` until the liveset is fully +formed at which point the state will be `Online`. + +When the `ha_start_daemon` script returns then Xapi will decide +whether to stand for master election or not. Initially when HA is being +enabled and there is a master already, this node will be expected to +stand unopposed. Later when HA notices that the master host has been +fenced, all remaining hosts will stand for election and one of them will +be chosen. + +Shutting down a host +-------------------- + +![Shutting down a host](HA.shutdown.svg) + +When a host is to be shutdown cleanly, it can be safely "excluded" +from the pool such that a future failure of the storage heartbeat will +not cause all pool hosts to self-fence (see survival rule 2 above). +When a host is "excluded" all other hosts know that the host does not +consider itself a master and has no resources locked i.e. no VMs are +running on it. An excluded host will never allow itself to form part +of a "split brain". + +Once a host has given up its master role and shutdown any VMs, it is safe +to disable fencing with `ha_disarm_fencing` and stop xhad with +`ha_stop_daemon`. Once the daemon has been stopped the "excluded" +bit can be set in the statefile via `ha_set_excluded` and the +host safely rebooted. + +Restarting a host +----------------- + +When a host restarts after a failure Xapi notices that *ha_armed* is +set in the local database. Xapi + +- runs the `attach-static-vdis` script to attach the statefile and + database VDIs. This can fail if the storage is inaccessible; Xapi will + retry until it succeeds. +- runs the ha_start_daemon to join the liveset, or determine that HA + has been cleanly disabled (via setting the state to *Invalid*). + +In the special case where Xhad fails to access the statefile and the +host used to be a slave then Xapi will try to contact the previous master +and find out + +- who the new master is; +- whether HA is enabled on the Pool or not. + +If Xapi can confirm that HA was disabled then it will disarm itself and +join the new master. Otherwise it will keep waiting for the statefile +to recover. + +In the special case where the statefile has been destroyed and cannot +be recovered, there is an emergency HA disable API the admin can use to +assert that HA really has been disabled, and it's not simply a connectivity +problem. Obviously this API should only be used if the admin is totally +sure that HA has been disabled. + +Disabling HA +------------ + +There are 2 methods of disabling HA: one for the "normal" case when the +statefile is available; and the other for the "emergency" case when the +statefile has failed and can't be recovered. + +Disabling HA cleanly +-------------------- + +![Disabling HA cleanly](HA.disable.clean.svg) + +HA can be shutdown cleanly when the statefile is working i.e. when hosts +are alive because of survival rule 1. First the master Xapi tells the local +Xhad to mark the pool state as "invalid" using `ha_set_pool_state`. +Every xhad instance will notice this state change the next time it performs +a storage heartbeat. The Xhad instances will shutdown and Xapi will notice +that HA has been disabled the next time it attempts to query the liveset. + +If a host loses access to the statefile (or if none of the hosts have +access to the statefile) then HA can be disabled uncleanly. + +Disabling HA uncleanly +---------------------- + +The Xapi master first calls `Host.ha_disable_failover_actions` on each host +which sets `ha_disable_failover_decisions` in the lcoal database. This +prevents the node rebooting, gaining statefile access, acquiring the +master lock and restarting VMs when other hosts have disabled their +fencing (i.e. a "split brain"). + +![Disabling HA uncleanly](HA.disable.unclean.svg) + +Once the master is sure that no host will suddenly start recovering VMs +it is safe to call `Host.ha_disarm_fencing` which runs the script +`ha_disarm_fencing` and then shuts down the Xhad with `ha_stop_daemon`. + +Add a host to the pool +---------------------- + +We assume that adding a host to the pool is an operation the admin will +perform manually, so it is acceptable to disable HA for the duration +and to re-enable it afterwards. If a failure happens during this operation +then the admin will take care of it by hand. diff --git a/doc/content/toolstack/features/VGPU/index.md b/doc/content/toolstack/features/VGPU/index.md new file mode 100644 index 0000000000..83c5ea41fa --- /dev/null +++ b/doc/content/toolstack/features/VGPU/index.md @@ -0,0 +1,190 @@ ++++ +title = "vGPU" ++++ + +XenServer has supported passthrough for GPU devices since XenServer 6.0. Since +the advent of NVIDIA's vGPU-capable GRID K1/K2 cards it has been possible to +carve up a GPU into smaller pieces yielding a more scalable solution to +boosting graphics performance within virtual machines. + +The K1 has four GK104 GPUs and the K2 two GK107 GPUs. Each of these will be exposed through Xapi so a host with a single K1 card will have access to four independent PGPUs. + +Each of the GPUs can then be subdivided into vGPUs. For each type of PGPU, +there are a few options of vGPU type which consume different amounts of the +PGPU. For example, K1 and K2 cards can currently be configured in the following +ways: + +![Possible VGX configurations](vgx-configs.png) + +Note, this diagram is not to scale, the PGPU resource required by each +vGPU type is as follows: + +| vGPU type | PGPU kind | vGPUs / PGPU | +| --------- | --------- | ------------ | +| k100 | GK104 | 8 | +| k140Q | GK104 | 4 | +| k200 | GK107 | 8 | +| k240Q | GK107 | 4 | +| k260Q | GK107 | 2 | + +Currently each physical GPU (PGPU) only supports *homogeneous vGPU +configurations* but different configurations are supported on different PGPUs +across a single K1/K2 card. This means that, for example, a host with a K1 card +can run 64 VMs with k100 vGPUs (8 per PGPU). + +## XenServer's vGPU architecture +A new display type has been added to the device model: + +```udiff +@@ -4519,6 +4522,7 @@ static const QEMUOption qemu_options[] = + + /* Xen tree options: */ + { "std-vga", 0, QEMU_OPTION_std_vga }, ++ { "vgpu", 0, QEMU_OPTION_vgpu }, + { "videoram", HAS_ARG, QEMU_OPTION_videoram }, + { "d", HAS_ARG, QEMU_OPTION_domid }, /* deprecated; for xend compatibility */ + { "domid", HAS_ARG, QEMU_OPTION_domid }, +``` + +With this in place, `qemu` can now be started using a new option that will +enable it to communicate with a new display emulator, `vgpu` to expose the +graphics device to the guest. The `vgpu` binary is responsible for handling the +VGX-capable GPU and, once it has been successfully passed through, the in-guest +drivers can be installed in the same way as when it detects new hardware. + +The diagram below shows the relevant parts of the architecture for this +project. + +![XenServer's vGPU architecture](vgpu-arch.png) + +### Relevant code +* In Xenopsd: [Xenops_server_xen][1] is where +Xenopsd gets the vGPU information from the values passed from Xapi; +* In Xenopsd: [Device.__start][2] is where the `vgpu` process is started, if +necessary, before Qemu. + +## Xapi's API and data model + +A lot of work has gone into the toolstack to handle the creation and management +of VMs with vGPUs. We revised our data model, introducing a semantic link +between `VGPU` and `PGPU` objects to help with utilisation tracking; we +maintained the `GPU_group` concept as a pool-wide abstraction of PGPUs +available for VMs; and we added **`VGPU_types`** which are configurations for +`VGPU` objects. + +![Xapi's vGPU datamodel](vgpu-datamodel.png) + +**Aside:** The VGPU type in Xapi's data model predates this feature and was +synonymous with GPU-passthrough. A VGPU is simply a display device assigned to +a VM which may be a vGPU (this feature) or a whole GPU (a VGPU of type +_passthrough_). + +**`VGPU_types`** can be enabled/disabled on a **per-PGPU basis** allowing for +reservation of particular PGPUs for certain workloads. VGPUs are allocated on +PGPUs within their GPU group in either a _depth-first_ or _breadth-first_ +manner, which is configurable on a per-group basis. + +**`VGPU_types`** are created by xapi at startup depending on the available +hardware and config files present in dom0. They exist in the pool database, and +a primary key is used to avoid duplication. In XenServer 6.x the tuple of +`(vendor_name, model_name)` was used as the primary key, however this was not +ideal as these values are subject to change. XenServer 7.0 switched to a +[new primary key]({{site.baseurl}}/xapi/futures/vgpu-type-identifiers.html) +generated from static metadata, falling back to the old method for backwards +compatibility. + +A **`VGPU_type`** will be garbage collected when there is no VGPU of that type +and there is no hardware which supports that type. On VM import, all VGPUs and +VGPU_types will be created if necessary - if this results in the creation of a +new VGPU_type then the VM will not be usable until the required hardware and +drivers are installed. + +### Relevant code +* In Xapi: [Xapi_vgpu_type][3] contains the type definitions and parsing logic +for vGPUs; +* In Xapi: [Xapi_pgpu_helpers][4] defines the functions used to allocate vGPUs +on PGPUs. + +## Xapi <-> Xenopsd interface + +In XenServer 6.x, all VGPU config was added to the VM's `platform` field at +startup, and this information was used by xenopsd to start the display emulator. +See the relevant code [here][5]. + +In XenServer 7.0, to facilitate support of VGPU on Intel hardware in parallel +with the existing NVIDIA support, VGPUs were made first-class objects in the +xapi-xenopsd interface. The interface is described +[here]({{site.baseurl}}/features/futures/gpu-support-evolution.html). + +## VM startup + +On the pool master: + +* Assuming no WLB, all VM.start tasks pass through + [Xapi_vm_helpers.choose_host_for_vm_no_wlb][6]. If the VM has a vGPU, the list + of all hosts in the pool is split into a list of lists, where the first list + is the most optimal in terms of the GPU group's allocation mode and the PGPU + availability on each host. +* Each list of hosts in turn is passed to [Xapi_vm_placement.select_host][7], + which checks storage, network and memory availability, until a suitable host + is found. +* Once a host has been chosen, [allocate_vm_to_host][8] will set the + `VM.scheduled_to_be_resident_on` and `VGPU.scheduled_to_be_resident_on` + fields. + +The task is then ready to be forwarded to the host on which the VM will start: + +* If the VM has a VGPU, the startup task is wrapped in + [Xapi_gpumon.with_gpumon_stopped][9]. This makes sure that the NVIDIA driver + is not in use so can be loaded or unloaded from physical GPUs as required. +* The VM metadata, including VGPU metadata, is passed to xenopsd. The creation + of the VGPU metadata is done by [vgpus_of_vm][10]. Note that at this point + passthrough VGPUs are represented by the PCI device type, and metadata is + generated by [pcis_of_vm][11]. +* As part of starting up the VM, xenopsd should report a [VGPU event][12] or a + [PCI event][13], which xapi will use to indicate that the xapi VGPU object can + be marked as `currently_attached`. + +## Usage + +To create a VGPU of a given type you can use `vgpu-create`: + +```bash +$ xe vgpu-create vm-uuid=... gpu-group-uuid=... vgpu-type-uuid=... +``` + +To see a list of VGPU types available for use on your XenServer, run the +following command. Note: these will only be populated if you have installed the +relevant NVIDIA RPMs and if there is hardware installed on that host supported +each type. Using `params=all` will display more information such as the maximum +number of heads supported by that VGPU type and which PGPUs have this type +enabled and supported. + +```bash +$ xe vgpu-type-list [params=all] +``` + +To access the new and relevant parameters on a PGPU (i.e. +`supported_VGPU_types`, `enabled_VGPU_types`, `resident_VGPUs`) you can use +`pgpu-param-get` with `param-name=supported-vgpu-types` +`param-name=enabled-vgpu-types` and `param-name=resident-vgpus` respectively. +Or, alternatively, you can use the following command to list all the parameters +for the PGPU. You can get the types supported or enabled for a given PGPU: + +```bash +$ xe pgpu-list uuid=... params=all +``` + +[1]: https://github.com/xapi-project/xenopsd/blob/8d06778db2/xc/xenops_server_xen.ml#L1107-L1113 +[2]: https://github.com/xapi-project/xenopsd/blob/8d06778db2/xc/device.ml#L1696-L1708 +[3]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_vgpu_type.ml +[4]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_pgpu_helpers.mli +[5]: https://github.com/xenserver/xen-api/blob/50bce20546/ocaml/xapi/vgpuops.ml#L149-L165 +[6]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_vm_helpers.ml#L618-L651 +[7]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_vm_placement.ml#L81-L97 +[8]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/message_forwarding.ml#L811-L828 +[9]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_vm.ml#L214-L220 +[10]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_xenops.ml#L698-L733 +[11]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_xenops.ml#L598-618 +[12]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_xenops.ml#L1841-L1854 +[13]: https://github.com/xapi-project/xen-api/blob/8a71a4aaaa/ocaml/xapi/xapi_xenops.ml#L1777-L1801 diff --git a/doc/content/toolstack/features/VGPU/vgpu-arch.png b/doc/content/toolstack/features/VGPU/vgpu-arch.png new file mode 100644 index 0000000000..b83a8ae2be Binary files /dev/null and b/doc/content/toolstack/features/VGPU/vgpu-arch.png differ diff --git a/doc/content/toolstack/features/VGPU/vgpu-datamodel.png b/doc/content/toolstack/features/VGPU/vgpu-datamodel.png new file mode 100644 index 0000000000..0357c7ec07 Binary files /dev/null and b/doc/content/toolstack/features/VGPU/vgpu-datamodel.png differ diff --git a/doc/content/toolstack/features/VGPU/vgx-configs.png b/doc/content/toolstack/features/VGPU/vgx-configs.png new file mode 100644 index 0000000000..4defcdb621 Binary files /dev/null and b/doc/content/toolstack/features/VGPU/vgx-configs.png differ diff --git a/doc/content/toolstack/features/XSM/index.md b/doc/content/toolstack/features/XSM/index.md new file mode 100644 index 0000000000..0fee862253 --- /dev/null +++ b/doc/content/toolstack/features/XSM/index.md @@ -0,0 +1,24 @@ ++++ +title = "Xapi Storage Migration" ++++ + +The Xapi Storage Migration (XSM) also known as "Storage Motion" allows + +- a running VM to be migrated within a pool, between different hosts + and different storage simultaneously; +- a running VM to be migrated to another pool; +- a disk attached to a running VM to be moved to another SR. + +The following diagram shows how XSM works at a high level: + +![Xapi Storage Migration](xsm.png) + +The slowest part of a storage migration is migrating the storage, since virtual +disks can be very large. Xapi starts by taking a snapshot and copying that to +the destination as a background task. Before the datapath connecting the VM +to the disk is re-established, xapi tells `tapdisk` to start mirroring all +writes to a remote `tapdisk` over NBD. From this point on all VM disk writes +are written to both the old and the new disk. +When the background snapshot copy is complete, xapi can migrate the VM memory +across. Once the VM memory image has been received, the destination VM is +complete and the original can be safely destroyed. diff --git a/doc/content/toolstack/features/XSM/xsm.png b/doc/content/toolstack/features/XSM/xsm.png new file mode 100644 index 0000000000..580a7342fc Binary files /dev/null and b/doc/content/toolstack/features/XSM/xsm.png differ diff --git a/doc/content/toolstack/features/_index.md b/doc/content/toolstack/features/_index.md new file mode 100644 index 0000000000..4ebf11cc3c --- /dev/null +++ b/doc/content/toolstack/features/_index.md @@ -0,0 +1,7 @@ ++++ +title = "Features" +weight = 50 ++++ + +{{% children %}} + diff --git a/doc/content/toolstack/features/snapshots/coalesce1.graffle b/doc/content/toolstack/features/snapshots/coalesce1.graffle new file mode 100644 index 0000000000..a854bea135 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/coalesce1.graffle differ diff --git a/doc/content/toolstack/features/snapshots/coalesce1.png b/doc/content/toolstack/features/snapshots/coalesce1.png new file mode 100644 index 0000000000..4c475d16fd Binary files /dev/null and b/doc/content/toolstack/features/snapshots/coalesce1.png differ diff --git a/doc/content/toolstack/features/snapshots/coalesce2.graffle b/doc/content/toolstack/features/snapshots/coalesce2.graffle new file mode 100644 index 0000000000..0543ecd985 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/coalesce2.graffle differ diff --git a/doc/content/toolstack/features/snapshots/coalesce2.png b/doc/content/toolstack/features/snapshots/coalesce2.png new file mode 100644 index 0000000000..458b7138ff Binary files /dev/null and b/doc/content/toolstack/features/snapshots/coalesce2.png differ diff --git a/doc/content/toolstack/features/snapshots/coalesce3.graffle b/doc/content/toolstack/features/snapshots/coalesce3.graffle new file mode 100644 index 0000000000..3e97b125ee Binary files /dev/null and b/doc/content/toolstack/features/snapshots/coalesce3.graffle differ diff --git a/doc/content/toolstack/features/snapshots/coalesce3.png b/doc/content/toolstack/features/snapshots/coalesce3.png new file mode 100644 index 0000000000..e3da83ae51 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/coalesce3.png differ diff --git a/doc/content/toolstack/features/snapshots/index.md b/doc/content/toolstack/features/snapshots/index.md new file mode 100644 index 0000000000..789f35fa1f --- /dev/null +++ b/doc/content/toolstack/features/snapshots/index.md @@ -0,0 +1,168 @@ ++++ +title = "Snapshots" ++++ + +Snapshots represent the state of a VM, or a disk (VDI) at a point in time. They can be used for: + +- backups (hourly, daily, weekly etc) +- experiments (take snapshot, try something, revert back again) +- golden images (install OS, get it just right, clone it 1000s of times) + +Read more about [the Snapshot APIs](../../xen-api/snapshots.html). + +Disk snapshots +============== + +Disks are represented in the XenAPI as VDI objects. Disk snapshots are represented +as VDI objects with the flag `is_a_snapshot` set to true. Snapshots are always +considered read-only, and should only be used for backup or cloning into new +disks. Disk snapshots have a lifetime independent of the disk they are a snapshot +of i.e. if someone deletes the original disk, the snapshots remain. This contrasts +with some storage arrays in which snapshots are "second class" objects which are +automatically deleted when the original disk is deleted. + +Disks are implemented in Xapi via "Storage Manager" (SM) plugins. The SM plugins +conform to an api (the SMAPI) which has operations including + +- vdi_create: make a fresh disk, full of zeroes +- vdi_snapshot: create a snapshot of a disk + + +File-based vhd implementation +============================= + +The existing "EXT" and "NFS" file-based Xapi SM plugins store disk data in +trees of .vhd files as in the following diagram: + +![Relationship between VDIs and vhd files](vhd-trees.png) + +From the XenAPI point of view, we have one current VDI and a set of snapshots, +each taken at a different point in time. These VDIs correspond to leaf vhds in +a tree stored on disk, where the non-leaf nodes contain all the shared blocks. + +The vhd files are always thinly-provisioned which means they only allocate new +blocks on an as-needed basis. The snapshot leaf vhd files only contain vhd +metadata and therefore are very small (a few KiB). The parent nodes containing +the shared blocks only contain the shared blocks. The current leaf initially +contains only the vhd metadata and therefore is very small (a few KiB) and will +only grow when the VM writes blocks. + +File-based vhd implementations are a good choice if a "gold image" snapshot +is going to be cloned lots of times. + +Block-based vhd implementation +============================== + +The existing "LVM", "LVMoISCSI" and "LVMoHBA" block-based Xapi SM plugins store +disk data in trees of .vhd files contained within LVM logical volumes: + +![Relationship between VDIs and LVs containing vhd data](lun-trees.png) + +Non-snapshot VDIs are always stored full size (a.k.a. thickly-provisioned). +When parent nodes are created they are automatically shrunk to the minimum size +needed to store the shared blocks. The LVs corresponding with snapshot VDIs +only contain vhd metadata and by default consume 8MiB. Note: this is different +to VDI.clones which are stored full size. + +Block-based vhd implementations are not a good choice if a "gold image" snapshot +is going to be cloned lots of times, since each clone will be stored full size. + +Hypothetical LUN implementation +=============================== + +A hypothetical Xapi SM plugin could use LUNs on an iSCSI storage array +as VDIs, and the array's custom control interface to implement the "snapshot" +operation: + +![Relationship between VDIs and LUNs on a hypothetical storage target](luns.png) + +From the XenAPI point of view, we have one current VDI and a set of snapshots, +each taken at a different point in time. These VDIs correspond to LUNs on the +same iSCSI target, and internally within the target these LUNs are comprised of +blocks from a large shared copy-on-write pool with support for dedup. + +Reverting disk snapshots +======================== + +There is no current way to revert in-place a disk to a snapshot, but it is +possible to create a writable disk by "cloning" a snapshot. + +VM snapshots +============ + +Let's say we have a VM, "VM1" that has 2 disks. Concentrating only +on the VM, VBDs and VDIs, we have the following structure: + +![VM objects](vm.png) + +When we take a snapshot, we first ask the storage backends to snapshot +all of the VDIs associated with the VM, producing new VDI objects. +Then we copy all of the metadata, producing a new 'snapshot' VM +object, complete with its own VBDs copied from the original, but now +pointing at the snapshot VDIs. We also copy the VIFs and VGPUs +but for now we will ignore those. + +This process leads to a set of objects that look like this: + +![VM and snapshot objects](vm-snapshot.png) + +We have fields that help navigate the new objects: ```VM.snapshot_of```, +and ```VDI.snapshot_of```. These, like you would expect, point to the +relevant other objects. + +Deleting VM snapshots +===================== + +When a snapshot is deleted Xapi calls the SM API `vdi_delete`. The Xapi SM +plugins which use vhd format data do not reclaim space immediately; instead +they mark the corresponding vhd leaf node as "hidden" and, at some point later, +run a garbage collector process. + +The garbage collector will first determine whether a "coalesce" should happen i.e. +whether any parent nodes have only one child i.e. the "shared" blocks are only +"shared" with one other node. In the following example the snapshot delete leaves +such a parent node and the coalesce process copies blocks from the redundant +parent's only child into the parent: + +![We coalesce parent blocks into grand parent nodes](coalesce1.png) + +Note that if the vhd data is being stored in LVM, then the parent node will +have had to be expanded to full size to accommodate the writes. Unfortunately +this means the act of reclaiming space actually consumes space itself, which +means it is important to never completely run out of space in such an SR. + +Once the blocks have been copied, we can now cut one of the parents out of the +tree by relinking its children into their grandparent: + +![Relink children into grand parent](coalesce2.png) + +Finally the garbage collector can remove unused vhd files / LVM LVs: + +![Clean up](coalesce3.png) + +Reverting VM snapshots +====================== + +The XenAPI call `VM.revert` overwrites the VM metadata with the snapshot VM +metadata, deletes the current VDIs and replaces them with clones of the +snapshot VDIs. Note there is no "vdi_revert" in the SMAPI. + +Revert implementation details +----------------------------- + +This is the process by which we revert a VM to a snapshot. The +first thing to notice is that there is some logic that is called +from [message_forwarding.ml](https://github.com/xapi-project/xen-api/blob/ce6d3f276f0a56ef57ebcf10f45b0f478fd70322/ocaml/xapi/message_forwarding.ml#L1528), +which uses some low-level database magic to turn the current VM +record into one that looks like the snapshot object. We then go +to the rest of the implementation in [xapi_vm_snapshot.ml](https://github.com/xapi-project/xen-api/blob/ce6d3f276f0a56ef57ebcf10f45b0f478fd70322/ocaml/xapi/xapi_vm_snapshot.ml#L403). +First, +we shut down the VM if it is currently running. Then, we revert +all of the [VBDs, VIFs and VGPUs](https://github.com/xapi-project/xen-api/blob/ce6d3f276f0a56ef57ebcf10f45b0f478fd70322/ocaml/xapi/xapi_vm_snapshot.ml#L270). +To revert the VBDs, we need to deal with the VDIs underneath them. +In order to create space, the first thing we do is [delete all of +the VDIs](https://github.com/xapi-project/xen-api/blob/ce6d3f276f0a56ef57ebcf10f45b0f478fd70322/ocaml/xapi/xapi_vm_snapshot.ml#L287) currently attached via VBDs to the VM. +We then _clone_ the disks from the snapshot. Note that there is +no SMAPI operation 'revert' currently - we simply clone from +the snapshot VDI. It's important to note that cloning +creates a _new_ VDI object: this is not the one we started with gone. diff --git a/doc/content/toolstack/features/snapshots/lun-trees.graffle b/doc/content/toolstack/features/snapshots/lun-trees.graffle new file mode 100644 index 0000000000..01403f9aed Binary files /dev/null and b/doc/content/toolstack/features/snapshots/lun-trees.graffle differ diff --git a/doc/content/toolstack/features/snapshots/lun-trees.png b/doc/content/toolstack/features/snapshots/lun-trees.png new file mode 100644 index 0000000000..0e22963864 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/lun-trees.png differ diff --git a/doc/content/toolstack/features/snapshots/luns.graffle b/doc/content/toolstack/features/snapshots/luns.graffle new file mode 100644 index 0000000000..9af3ba8a84 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/luns.graffle differ diff --git a/doc/content/toolstack/features/snapshots/luns.png b/doc/content/toolstack/features/snapshots/luns.png new file mode 100644 index 0000000000..cd2dcbdcf7 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/luns.png differ diff --git a/doc/content/toolstack/features/snapshots/vhd-trees.graffle b/doc/content/toolstack/features/snapshots/vhd-trees.graffle new file mode 100644 index 0000000000..f62ebce4eb Binary files /dev/null and b/doc/content/toolstack/features/snapshots/vhd-trees.graffle differ diff --git a/doc/content/toolstack/features/snapshots/vhd-trees.png b/doc/content/toolstack/features/snapshots/vhd-trees.png new file mode 100644 index 0000000000..09f30619de Binary files /dev/null and b/doc/content/toolstack/features/snapshots/vhd-trees.png differ diff --git a/doc/content/toolstack/features/snapshots/vm-snapshot.graffle b/doc/content/toolstack/features/snapshots/vm-snapshot.graffle new file mode 100644 index 0000000000..b311e83de2 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/vm-snapshot.graffle differ diff --git a/doc/content/toolstack/features/snapshots/vm-snapshot.png b/doc/content/toolstack/features/snapshots/vm-snapshot.png new file mode 100644 index 0000000000..4b193fc830 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/vm-snapshot.png differ diff --git a/doc/content/toolstack/features/snapshots/vm.graffle b/doc/content/toolstack/features/snapshots/vm.graffle new file mode 100644 index 0000000000..e8a47a6f88 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/vm.graffle differ diff --git a/doc/content/toolstack/features/snapshots/vm.png b/doc/content/toolstack/features/snapshots/vm.png new file mode 100644 index 0000000000..c7c4f89837 Binary files /dev/null and b/doc/content/toolstack/features/snapshots/vm.png differ diff --git a/doc/content/xapi/_index.md b/doc/content/xapi/_index.md new file mode 100644 index 0000000000..4e185809f9 --- /dev/null +++ b/doc/content/xapi/_index.md @@ -0,0 +1,117 @@ ++++ +title = "Xapi" +weight = 20 ++++ + +Xapi is the [xapi-project](http://github.com/xapi-project) host and cluster manager. + +Xapi is responsible for: + +- providing a stable interface (the XenAPI) +- allowing one client to manage multiple hosts +- hosting the "xe" CLI +- authenticating users and applying role-based access control +- locking resources (in particular disks) +- allowing storage to be managed through plugins +- planning and coping with host failures ("High Availability") +- storing VM and host configuration +- generating alerts +- managing software patching + +## Principles + +1. The XenAPI interface must remain backwards compatible, allowing older + clients to continue working +2. Xapi delegates all Xenstore/libxc/libxl access to Xenopsd, so Xapi could + be run in an unprivileged helper domain +3. Xapi delegates the low-level storage manipulation to SM plugins. +4. Xapi delegates setting up host networking to xcp-networkd. +5. Xapi delegates monitoring performance counters to xcp-rrdd. + +## Overview + +The following diagram shows the internals of Xapi: + +![Internals of xapi](xapi.png) + +The top of the diagram shows the XenAPI clients: XenCenter, XenOrchestra, +OpenStack and CloudStack using XenAPI and HTTP GET/PUT over ports 80 and 443 to +talk to xapi. These XenAPI (JSON-RPC or XML-RPC over HTTP POST) and HTTP +GET/PUT are always authenticated using either PAM (by default using the local +passwd and group files) or through Active Directory. + +The APIs are classified into categories: + +- coordinator-only: these are the majority of current APIs. The coordinator + should be called and relied upon to forward the call to the right place with + the right locks held. +- normally-local: these are performance special cases + such as disk import/export and console connection which are sent directly to + hosts which have the most efficient access to the data. +- emergency: these deal with scenarios where the coordinator is offline + +If the incoming API call should be resent to the coordinator than a XenAPI +`HOST_IS_SLAVE` error message containing the coordinator's IP is sent to the +client. + +Once past the initial checks, API calls enter the "message forwarding" layer which + +- locks resources (via the `current_operations` mechanism) +- decides which host should execute the request. + +If the request should run locally then a direct function call is used; +otherwise the message forwarding code makes a synchronous API call to a +specific other host. Note: Xapi currently employs a "thread per request" model +which causes one full POSIX thread to be created for every request. Even when a +request is forwarded the full thread persists, blocking for the result to +become available. + +If the XenAPI call is a VM lifecycle operation then it is converted into a +Xenopsd API call and forwarded over a Unix domain socket. Xapi and Xenopsd have +similar notions of cancellable asynchronous "tasks", so the current Xapi task +(all operations run in the context of a task) is bound to the Xenopsd task, so +cancellation is passed through and progress updates are received. + +If the XenAPI call is a storage operation then the "storage access" layer + +- verifies that the storage objects are in the correct state (SR + attached/detached; VDI attached/activated read-only/read-write) +- invokes the relevant operation in the Storage Manager API (SMAPI) v2 + interface; +- depending on the type of SR: + - uses the SMAPIv2 to SMAPIv1 converter to generate the necessary command-line + to talk to the SMAPIv1 plugin (EXT, NFS, LVM etc) and to execute it + - uses the SMAPIv2 to SMAPIv3 converter daemon xapi-storage-script to + exectute the necessary SMAPIv3 command (GFS2) +- persists the state of the storage objects (including the result of a + `VDI.attach` call) to persistent storage + +Internally the SMAPIv1 plugins use privileged access to the Xapi database to +directly set fields (e.g. VDI.virtual_size) that would be considered read/only +to other clients. The SMAPIv1 plugins also rely on Xapi for + +- knowledge of all hosts which may access the storage +- locking of disks within the resource pool +- safely executing code on other hosts via the "Xapi plugin" mechanism + +The Xapi database contains Host and VM metadata and is shared pool-wide. The +coordinator keeps a copy in memory, and all other nodes remote queries to the +coordinator. The database associates each object with a generation count which +is used to implement the XenAPI `event.next` and `event.from` APIs. The +database is routinely asynchronously flushed to disk in XML format. If the +"redo-log" is enabled then all database writes are made synchronously as deltas +to a shared block device. Without the redo-log, recent updates may be lost if +Xapi is killed before a flush. + +High-Availability refers to planning for host failure, monitoring host liveness +and then following-through on the plans. Xapi defers to an external host +liveness monitor called `xhad`. When `xhad` confirms that a host has failed -- +and has been isolated from the storage -- then Xapi will restart any VMs which +have failed and which have been marked as "protected" by HA. Xapi can also +impose admission control to prevent the pool becoming too overloaded to cope +with `n` arbitrary host failures. + +The `xe` CLI is implemented in terms of the XenAPI, but for efficiency the +implementation is linked directly into Xapi. The `xe` program remotes its +command-line to Xapi, and Xapi sends back a series of simple commands (prompt +for input; print line; fetch file; exit etc). diff --git a/doc/content/xapi/guides/_index.md b/doc/content/xapi/guides/_index.md new file mode 100644 index 0000000000..146d051f26 --- /dev/null +++ b/doc/content/xapi/guides/_index.md @@ -0,0 +1,7 @@ ++++ +title = "Guides" +weight = 1000 ++++ +Helpful guides for xapi developers. + +{{% children depth="3" sort="Weight" %}} diff --git a/doc/content/xapi/guides/howtos/_index.md b/doc/content/xapi/guides/howtos/_index.md new file mode 100644 index 0000000000..6f5b5b01e3 --- /dev/null +++ b/doc/content/xapi/guides/howtos/_index.md @@ -0,0 +1,3 @@ ++++ +title = "How to add...." ++++ \ No newline at end of file diff --git a/doc/content/xapi/guides/howtos/add-api-extension.md b/doc/content/xapi/guides/howtos/add-api-extension.md new file mode 100644 index 0000000000..487c499764 --- /dev/null +++ b/doc/content/xapi/guides/howtos/add-api-extension.md @@ -0,0 +1,79 @@ ++++ +title = "Adding a XenAPI extension" ++++ + +A XenAPI extension is a new RPC which is implemented as a separate executable +(i.e. it is not part of `xapi`) +but which still benefits from `xapi` parameter type-checking, multi-language +stub generation, documentation generation, authentication etc. +An extension can be backported to previous versions by simply adding the +implementation, without having to recompile `xapi` itself. + +A XenAPI extension is in two parts: + +1. a declaration in the [xapi datamodel](https://github.com/xapi-project/xen-api/blob/07056d661bbf58b652e1da59d9adf67a778a5626/ocaml/idl/datamodel.ml#L5608). +This must use the `~forward_to:(Extension "filename")` parameter. The filename must be unique, and +should be the same as the XenAPI call name. +2. an implementation executable in the dom0 filesystem with path `/etc/xapi.d/extensions/filename` + +To define an extension +---------------------- + +First write the declaration in the datamodel. The act of specifying the +types and writing the documentation will help clarify the intended meaning +of the call. + +Second create a prototype of your implementation and put an executable file +in `/etc/xapi.d/extensions/filename`. The calling convention is: + +- the file must be executable +- `xapi` will parse the XMLRPC call arguments received over the network and check the `session_id` is + valid +- `xapi` will execute the named executable +- the XMLRPC call arguments will be sent to the executable on `stdin` and + `stdin` will be closed afterwards +- the executable will run and print an XMLRPC response on `stdout` +- `xapi` will read the response and return it to the client. + +See the [basic example](https://github.com/xapi-project/xen-api/blob/07056d661bbf58b652e1da59d9adf67a778a5626/scripts/extensions/Test.test). + +Second make a [pull request](https://github.com/xapi-project/xen-api/pulls) +containing only the datamodel definitions (it is not necessary to include the +prototype too). +This will attract review comments which will help you improve your API further. +Once the pull request is merged, then the API call name and extension are officially +yours and you may use them on any xapi version which supports the extension mechanism. + +Packaging your extension +------------------------ + +Your extension `/etc/xapi.d/extensions/filename` (and dependencies) should be +packaged for your target distribution (for XenServer dom0 this would be a CentOS +RPM). Once the package is unpacked on the target machine, the extension should +be immediately callable via the XenAPI, provided the `xapi` version supports +the extension mechanism. Note the `xapi` version does not need to know about +the specific extension in advance: it will always look in `/etc/xapi.d/extensions/` for +all RPC calls whose name it does not recognise. + +Limitations +----------- + +On type-checking + +- if the `xapi` version is new enough to know about your specific extension: + `xapi` will type-check the call arguments for you +- if the `xapi` version is too old to know about your specific extension: + the extension will still be callable but the arguments will not be type-checked. + +On access control + +- if the `xapi` version is new enough to know about your specific extension: + you can declare that a user must have a particular role (e.g. 'VM admin') +- if the `xapi` version is too old to know about your specific extension: + the extension will still be callable but the client must have the 'Pool admin' role. + +Since a `xapi` which knows about your specific extension is stricter than an older +`xapi`, it's a good idea to develop against the new `xapi` and then test older +`xapi` versions later. + + diff --git a/doc/content/xapi/guides/howtos/add-class.md b/doc/content/xapi/guides/howtos/add-class.md new file mode 100644 index 0000000000..9e4680059d --- /dev/null +++ b/doc/content/xapi/guides/howtos/add-class.md @@ -0,0 +1,452 @@ ++++ +title = "Adding a Class to the API" ++++ + +This document describes how to add a new class to the data model that +defines the Xen Server API. It complements two other documents that +describe how to extend an existing class: + +* [Adding a Field]({{< ref add-field.md >}}) +* [Adding a Function]({{< ref add-function.md >}}) + +As a running example, we will use the addition of a class that is part +of the design for the PVS Direct feature. PVS Direct introduces +proxies that serve VMs with disk images. This class was added via commit +[CP-16939] to Xen API. + +## Example: PVS_server + +In the world of Xen Server, each important concept like a virtual +machine, interface, or users is represented by a class in the data model. +A class defines methods and instance variables. At runtime, all class +instances are held in an in-memory database. For example, part of [PVS +Direct] is a class `PVS_server`, representing a resource that provides +block-level data for virtual machines. The design document defines it to +have the following important properties: + +### Fields + +* `(string set) addresses` (RO/constructor) IPv4 addresses of the + server. + +* `(int) first_port` (RO/constructor) First UDP port accepted by the + server. + +* `(int) last_port` (RO/constructor) Last UDP port accepted by the + server. + +* `(PVS_farm ref) farm` (RO/constructor) Link to the farm that this + server is included in. A PVS_server object must always have a valid + farm reference; the PVS_server will be automatically GCā€™ed by xapi + if the associated PVS_farm object is removed. + +* `(string) uuid (R0/runtime)` Unique identifier/object reference. + Allocated by the server. + +### Methods (or Functions) + +* `(PVS_server ref) introduce (string set addresses, int first_port, + int last_port, PVS_farm ref farm)` Introduce a new PVS server into + the farm. Allowed at any time, even when proxies are in use. The + proxies will be updated automatically. + +* `(void) forget (PVS_server ref self)` Remove a PVS server from the + farm. Allowed at any time, even when proxies are in use. The + proxies will be updated automatically. + + +### Implementation Overview + +The implementation of a class is distributed over several files: + +* `ocaml/idl/datamodel.ml` -- central class definition +* `ocaml/idl/datamodel_types.ml` -- definition of releases +* `ocaml/xapi/cli_frontend.ml` -- declaration of CLI operations +* `ocaml/xapi/cli_operations.ml` -- implementation of CLI operations +* `ocaml/xapi/records.ml` -- getters and setters +* `ocaml/xapi/OMakefile` -- refers to `xapi_pvs_farm.ml` +* `ocaml/xapi/api_server.ml` -- refers to `xapi_pvs_farm.ml` +* `ocaml/xapi/message_forwarding.ml` +* `ocaml/xapi/xapi_pvs_farm.ml` -- implementation of methods, new file + +### Data Model + +The data model `ocaml/idl/datamodel.ml` defines the class. To keep the +name space tidy, most helper functions are grouped into an internal +module: + + (* datamodel.ml *) + + let schema_minor_vsn = 103 (* line 21 -- increment this *) + let _pvs_farm = "PVS_farm" (* line 153 *) + + module PVS_farm = struct (* line 8658 *) + let lifecycle = [Prototyped, rel_dundee_plus, ""] + + let introduce = call + ~name:"introduce" + ~doc:"Introduce new PVS farm" + ~result:(Ref _pvs_farm, "the new PVS farm") + ~params: + [ String,"name","name of the PVS farm" + ] + ~lifecycle + ~allowed_roles:_R_POOL_OP + () + + let forget = call + ~name:"forget" + ~doc:"Remove a farm's meta data" + ~params: + [ Ref _pvs_farm, "self", "this PVS farm" + ] + ~errs:[ + Api_errors.pvs_farm_contains_running_proxies; + Api_errors.pvs_farm_contains_servers; + ] + ~lifecycle + ~allowed_roles:_R_POOL_OP + () + + + let set_name = call + ~name:"set_name" + ~doc:"Update the name of the PVS farm" + ~params: + [ Ref _pvs_farm, "self", "this PVS farm" + ; String, "value", "name to be used" + ] + ~lifecycle + ~allowed_roles:_R_POOL_OP + () + + let add_cache_storage = call + ~name:"add_cache_storage" + ~doc:"Add a cache SR for the proxies on the farm" + ~params: + [ Ref _pvs_farm, "self", "this PVS farm" + ; Ref _sr, "value", "SR to be used" + ] + ~lifecycle + ~allowed_roles:_R_POOL_OP + () + + let remove_cache_storage = call + ~name:"remove_cache_storage" + ~doc:"Remove a cache SR for the proxies on the farm" + ~params: + [ Ref _pvs_farm, "self", "this PVS farm" + ; Ref _sr, "value", "SR to be removed" + ] + ~lifecycle + ~allowed_roles:_R_POOL_OP + () + + let obj = + let null_str = Some (VString "") in + let null_set = Some (VSet []) in + create_obj (* <---- creates class *) + ~name: _pvs_farm + ~descr:"machines serving blocks of data for provisioning VMs" + ~doccomments:[] + ~gen_constructor_destructor:false + ~gen_events:true + ~in_db:true + ~lifecycle + ~persist:PersistEverything + ~in_oss_since:None + ~messages_default_allowed_roles:_R_POOL_OP + ~contents: + [ uid _pvs_farm ~lifecycle + + ; field ~qualifier:StaticRO ~lifecycle + ~ty:String "name" ~default_value:null_str + "Name of the PVS farm. Must match name configured in PVS" + + ; field ~qualifier:DynamicRO ~lifecycle + ~ty:(Set (Ref _sr)) "cache_storage" ~default_value:null_set + ~ignore_foreign_key:true + "The SR used by PVS proxy for the cache" + + ; field ~qualifier:DynamicRO ~lifecycle + ~ty:(Set (Ref _pvs_server)) "servers" + "The set of PVS servers in the farm" + + + ; field ~qualifier:DynamicRO ~lifecycle + ~ty:(Set (Ref _pvs_proxy)) "proxies" + "The set of proxies associated with the farm" + ] + ~messages: + [ introduce + ; forget + ; set_name + ; add_cache_storage + ; remove_cache_storage + ] + () + end + let pvs_farm = PVS_farm.obj + +The class is defined by a call to `create_obj` and it defines the +fields and messages (methods) belonging to the class. Each field has a +name, a type, and some meta information. Likewise, each message +(or method) is created by `call` that describes its parameters. + +The `PVS_farm` has additional getter and setter methods for accessing +its fields. These are not declared here as part of the messages +but are automatically generated. + +To make sure the new class is actually used, it is important to enter it +into two lists: + + (* datamodel.ml *) + let all_system = (* line 8917 *) + [ + ... + vgpu_type; + pvs_farm; + ... + ] + + let expose_get_all_messages_for = [ (* line 9097 *) + ... + _pvs_farm; + _pvs_server; + _pvs_proxy; + +When a field refers to another object that itself refers back to it, +these two need to be entered into the `all_relations` list. For example, +`_pvs_server` refers to a `_pvs_farm` value via `"farm"`, which, in +turn, refers to the `_pvs_server` value via its `"servers"` field. + + let all_relations = + [ + (* ... *) + (_sr, "introduced_by"), (_dr_task, "introduced_SRs"); + (_pvs_server, "farm"), (_pvs_farm, "servers"); + (_pvs_proxy, "farm"), (_pvs_farm, "proxies"); + ] + + +## CLI Conventions + +The CLI provides access to objects from the command line. The following +conventions exist for naming fields: + +* A field in the data model uses an underscore (`_`) but a hyphen (`-`) + in the CLI: what is `cache_storage` in the data model becomes + `cache-storage` in the CLI. + +* When a field contains a reference or multiple, like `proxies`, it + becomes `proxy-uuids` in the CLI because references are always + referred to by their UUID. + +## CLI Getters and Setters + +All fields can be read from the CLI and some fields can also be set via +the CLI. These getters and setters are mostly generated automatically +and need to be connected to the CLI through a function in +`ocaml/xapi/records.ml`. Note that field names here use the +naming convention for the CLI: + + (* ocaml/xapi/records.ml *) + let pvs_farm_record rpc session_id pvs_farm = + let _ref = ref pvs_farm in + let empty_record = + ToGet (fun () -> Client.PVS_farm.get_record rpc session_id !_ref) in + let record = ref empty_record in + let x () = lzy_get record in + { setref = (fun r -> _ref := r ; record := empty_record) + ; setrefrec = (fun (a,b) -> _ref := a; record := Got b) + ; record = x + ; getref = (fun () -> !_ref) + ; fields= + [ make_field ~name:"uuid" + ~get:(fun () -> (x ()).API.pVS_farm_uuid) () + ; make_field ~name:"name" + ~get:(fun () -> (x ()).API.pVS_farm_name) + ~set:(fun name -> + Client.PVS_farm.set_name rpc session_id !_ref name) () + ; make_field ~name:"cache-storage" + ~get:(fun () -> (x ()).API.pVS_farm_cache_storage + |> List.map get_uuid_from_ref |> String.concat "; ") + ~add_to_set:(fun sr_uuid -> + let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in + Client.PVS_farm.add_cache_storage rpc session_id !_ref sr) + ~remove_from_set:(fun sr_uuid -> + let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in + Client.PVS_farm.remove_cache_storage rpc session_id !_ref sr) + () + ; make_field ~name:"server-uuids" + ~get:(fun () -> (x ()).API.pVS_farm_servers + |> List.map get_uuid_from_ref |> String.concat "; ") + ~get_set:(fun () -> (x ()).API.pVS_farm_servers + |> List.map get_uuid_from_ref) + () + ; make_field ~name:"proxy-uuids" + ~get:(fun () -> (x ()).API.pVS_farm_proxies + |> List.map get_uuid_from_ref |> String.concat "; ") + ~get_set:(fun () -> (x ()).API.pVS_farm_proxies + |> List.map get_uuid_from_ref) + () + ] + } + +## CLI Interface to Methods + +Methods accessible from the CLI are declared in +`ocaml/xapi/cli_frontend.ml`. Each declaration refers to the real +implementation of the method, like `Cli_operations.PVS_far.introduce`: + + (* cli_frontend.ml *) + let rec cmdtable_data : (string*cmd_spec) list = + (* ... *) + "pvs-farm-introduce", + { + reqd=["name"]; + optn=[]; + help="Introduce new PVS farm"; + implementation=No_fd Cli_operations.PVS_farm.introduce; + flags=[]; + }; + "pvs-farm-forget", + { + reqd=["uuid"]; + optn=[]; + help="Forget a PVS farm"; + implementation=No_fd Cli_operations.PVS_farm.forget; + flags=[]; + }; + +## CLI Implementation of Methods + +Each CLI operation that is not a getter or setter has an implementation +in `cli_operations.ml` which is implemented in terms of the real +implementation: + + (* cli_operations.ml *) + module PVS_farm = struct + let introduce printer rpc session_id params = + let name = List.assoc "name" params in + let ref = Client.PVS_farm.introduce ~rpc ~session_id ~name in + let uuid = Client.PVS_farm.get_uuid rpc session_id ref in + printer (Cli_printer.PList [uuid]) + + let forget printer rpc session_id params = + let uuid = List.assoc "uuid" params in + let ref = Client.PVS_farm.get_by_uuid ~rpc ~session_id ~uuid in + Client.PVS_farm.forget rpc session_id ref + end + +Fields that should show up in the CLI interface by default are declared +in the `gen_cmds` value: + + (* cli_operations.ml *) + let gen_cmds rpc session_id = + let mk = make_param_funs in + List.concat + [ (*...*) + ; Client.Pool.(mk get_all get_all_records_where + get_by_uuid pool_record "pool" [] + ["uuid";"name-label";"name-description";"master" + ;"default-SR"] rpc session_id) + ; Client.PVS_farm.(mk get_all get_all_records_where + get_by_uuid pvs_farm_record "pvs-farm" [] + ["uuid";"name";"cache-storage";"server-uuids"] rpc session_id) + + +## Error messages + +Error messages used by an implementation are introduced in two files: + + (* ocaml/xapi-consts/api_errors.ml *) + let pvs_farm_contains_running_proxies = "PVS_FARM_CONTAINS_RUNNING_PROXIES" + let pvs_farm_contains_servers = "PVS_FARM_CONTAINS_SERVERS" + let pvs_farm_sr_already_added = "PVS_FARM_SR_ALREADY_ADDED" + let pvs_farm_sr_is_in_use = "PVS_FARM_SR_IS_IN_USE" + let sr_not_in_pvs_farm = "SR_NOT_IN_PVS_FARM" + let pvs_farm_cant_set_name = "PVS_FARM_CANT_SET_NAME" + + (* ocaml/idl/datamodel.ml *) + (* PVS errors *) + error Api_errors.pvs_farm_contains_running_proxies ["proxies"] + ~doc:"The PVS farm contains running proxies and cannot be forgotten." (); + + error Api_errors.pvs_farm_contains_servers ["servers"] + ~doc:"The PVS farm contains servers and cannot be forgotten." + (); + + error Api_errors.pvs_farm_sr_already_added ["farm"; "SR"] + ~doc:"Trying to add a cache SR that is already associated with the farm" + (); + + error Api_errors.sr_not_in_pvs_farm ["farm"; "SR"] + ~doc:"The SR is not associated with the farm." + (); + + error Api_errors.pvs_farm_sr_is_in_use ["farm"; "SR"] + ~doc:"The SR is in use by the farm and cannot be removed." + (); + + error Api_errors.pvs_farm_cant_set_name ["farm"] + ~doc:"The name of the farm can't be set while proxies are active." + () + +## Method Implementation + +The implementation of methods lives in a module in `ocaml/xapi`: + + (* ocaml/xapi/api_server.ml *) + module PVS_farm = Xapi_pvs_farm + +The file below is typically a new file and needs to be added to +`ocaml/xapi/OMakefile`. + + (* ocaml/xapi/xapi_pvs_farm.ml *) + module D = Debug.Make(struct let name = "xapi_pvs_farm" end) + module E = Api_errors + + let api_error msg xs = raise (E.Server_error (msg, xs)) + + let introduce ~__context ~name = + let pvs_farm = Ref.make () in + let uuid = Uuid.to_string (Uuid.make_uuid ()) in + Db.PVS_farm.create ~__context + ~ref:pvs_farm ~uuid ~name ~cache_storage:[]; + pvs_farm + + (* ... *) + + +Messages received on a slave host may or may not be executed there. In +the simple case, each methods executes locally: + + (* ocaml/xapi/message_forwarding.ml *) + module PVS_farm = struct + let introduce ~__context ~name = + info "PVS_farm.introduce %s" name; + Local.PVS_farm.introduce ~__context ~name + + let forget ~__context ~self = + info "PVS_farm.forget"; + Local.PVS_farm.forget ~__context ~self + + let set_name ~__context ~self ~value = + info "PVS_farm.set_name %s" value; + Local.PVS_farm.set_name ~__context ~self ~value + + let add_cache_storage ~__context ~self ~value = + info "PVS_farm.add_cache_storage"; + Local.PVS_farm.add_cache_storage ~__context ~self ~value + + let remove_cache_storage ~__context ~self ~value = + info "PVS_farm.remove_cache_storage"; + Local.PVS_farm.remove_cache_storage ~__context ~self ~value + end + + + +[CP-16939]: https://github.com/xenserver/xen-api/commit/78fe558dad19458a89519fe196069317d57eac58 +[Adding a Field]: add-field.html +[Adding a Function]: add-function.html diff --git a/doc/content/xapi/guides/howtos/add-field.md b/doc/content/xapi/guides/howtos/add-field.md new file mode 100644 index 0000000000..f8f69c5c43 --- /dev/null +++ b/doc/content/xapi/guides/howtos/add-field.md @@ -0,0 +1,156 @@ ++++ +title = "Adding a field to the API" ++++ +This page describes how to add a field to XenAPI. A field is a parameter of a class that can be used in functions and read from the API. + +Bumping the database schema version +----------------------------------- +Whenever a field is added to or removed from the API, its schema version needs +to be increased. XAPI needs this fundamental procedure in order to be able to +detect that an automatic database upgrade is necessary or to find out that the +new schema is incompatible with the existing database. If the schema version is +not bumped, XAPI will start failing in unpredictable ways. Note that bumping +the version is not necessary when adding functions, only when adding fields. + +The current version number is kept at the top of the file +`ocaml/idl/datamodel_common.ml` in the variables `schema_major_vsn` and +`schema_minor_vsn`, of which only the latter should be incremented (the major +version only exists for historical reasons). When moving to a new XenServer +release, also update the variable `last_release_schema_minor_vsn` to the schema +version of the last release. To keep track of the schema versions of recent +XenServer releases, the file contains variables for these, such as +`miami_release_schema_minor_vsn`. After starting a new version of Xapi on an +existing server, the database is automatically upgraded if the schema version +of the existing database matches the value of `last_release_schema_*_vsn` in the +new Xapi. + +As an example, the patch below shows how the schema version was bumped when the +new API fields used for ActiveDirectory integration were added: + + --- a/ocaml/idl/datamodel.ml Tue Nov 11 16:17:48 2008 +0000 + +++ b/ocaml/idl/datamodel.ml Tue Nov 11 15:53:29 2008 +0000 + @@ -15,17 +15,20 @@ open Datamodel_types + open Datamodel_types + + (* IMPORTANT: Please bump schema vsn if you change/add/remove a _field_. + You do not have to dump vsn if you change/add/remove a message *) + + let schema_major_vsn = 5 + -let schema_minor_vsn = 55 + +let schema_minor_vsn = 56 + + (* Historical schema versions just in case this is useful later *) + let rio_schema_major_vsn = 5 + let rio_schema_minor_vsn = 19 + + +let miami_release_schema_major_vsn = 5 + +let miami_release_schema_minor_vsn = 35 + + + (* the schema vsn of the last release: used to determine whether we can + upgrade or not.. *) + let last_release_schema_major_vsn = 5 + -let last_release_schema_minor_vsn = 35 + +let last_release_schema_minor_vsn = 55 + +### Setting the schema hash + +In the `ocaml/idl/schematest.ml` there is the `last_known_schema_hash` This needs to be updated to be the next hash after the schema version was bumped. Get the new hash by running `make test` and you will receive the correct hash in the error message. + +Adding the new field to some existing class +------------------------------------------- + +### ocaml/idl/datamodel.ml + +Add a new "field" line to the class in the file `ocaml/idl/datamodel.ml` or `ocaml/idl/datamodel_[class].ml`. The new field might require +a suitable default value. This default value is used in case the user does not +provide a value for the field. + +A field has a number of parameters: + +- The lifecycle parameter, which shows how the field has evolved over time. +- The qualifier parameter, which controls access to the field. The following + values are possible: + +| Value | Meaning | +| --------- | --------------------------------------------- | +| StaticRO | Field is set statically at install-time. | +| DynamicRO | Field is computed dynamically at run time. | +| RW | Field is read/write. | + +- The ty parameter for the type of the field. +- The default_value parameter. +- The name of the field. +- A documentation string. + +Example of a field in the pool class: + + field ~lifecycle:[Published, rel_orlando, "Controls whether HA is enabled"] + ~qualifier:DynamicRO ~ty:Bool + ~default_value:(Some (VBool false)) "ha_enabled" "true if HA is enabled on the pool, false otherwise"; + +See datamodel_types.ml for information about other parameters. + +## Changing Constructors + +Adding a field would change the constructors for the class ā€“ functions +Db.*.create ā€“ and therefore, any references to these in the code need to be +updated. In the example, the argument ~ha_enabled:false should be added to any +call to Db.Pool.create. + +Examples of where these calls can be found is in `ocaml/tests/common/test_common.ml` and `ocaml/xapi/xapi_[class].ml`. + +### CLI Records + +If you want this field to show up in the CLI (which you probably do), you will +also need to modify the Records module, in the file +`ocaml/xapi-cli-server/records.ml`. Find the record function for the class which +you have modified, add a new entry to the fields list using make_field. This type can be found in the same file. + +The only required parameters are name and get (and unit, of course ). +If your field is a map or set, then you will need to pass in get_{map,set}, and +optionally set_{map,set}, if it is a RW field. The hidden parameter is useful +if you don't want this field to show up in a *_params_list call. As an example, +here is a field that we've just added to the SM class: + + make_field ~name:"versioned-capabilities" + ~get:(fun () -> Record_util.s2sm_to_string "; " (x ()).API.sM_versioned_capabilities) + ~get_map:(fun () -> (x ()).API.sM_versioned_capabilities) + ~hidden:true (); + +Testing +------- +The new fields can be tested by copying the newly compiled xapi binary to a +test box. After the new xapi service is started, the file +*/var/log/xensource.log* in the test box should contain a few lines reporting the +successful upgrade of the metadata schema in the test box: + + [...|xapi] Db has schema major_vsn=5, minor_vsn=57 (current is 5 58) (last is 5 57) + [...|xapi] Database schema version is that of last release: attempting upgrade + [...|sql] attempting to restore database from /var/xapi/state.db + [...|sql] finished parsing xml + [...|sql] writing db as xml to file '/var/xapi/state.db'. + [...|xapi] Database upgrade complete, restarting to use new db + +Making this field accessible as a CLI attribute +----------------------------------------------- +XenAPI functions to get and set the value of the new field are generated +automatically. It requires some extra work, however, to enable such operations +in the CLI. + +The CLI has commands such as host-param-list and host-param-get. To make a new +field accessible by these commands, the file `xapi-cli-server/records.ml` needs to +be edited. For the pool.ha-enabled field, the pool_record function in this file +contains the following (note the convention to replace underscores by hyphens +in the CLI): + + let pool_record rpc session_id pool = + ... + [ + ... + make_field ~name:"ha-enabled" ~get:(fun () -> string_of_bool (x ()).API.pool_ha_enabled) (); + ... + ]} + +NB: the ~get parameter must return a string so include a relevant function to convert the type of the field into a string i.e. `string_of_bool` + +See `xapi-cli-server/records.ml` for examples of handling field types other than Bool. diff --git a/doc/content/xapi/guides/howtos/add-function.md b/doc/content/xapi/guides/howtos/add-function.md new file mode 100644 index 0000000000..07ef3cebfd --- /dev/null +++ b/doc/content/xapi/guides/howtos/add-function.md @@ -0,0 +1,303 @@ ++++ +title = "Adding a function to the API" ++++ +This page describes how to add a function to XenAPI. + +Add message to API +------------------ +The file `idl/datamodel.ml` is a description of the API, from which the +marshalling and handler code is generated. + +In this file, the `create_obj` function is used to define a class which may +contain fields and support operations (known as "messages"). For example, the +identifier host is defined using create_obj to encapsulate the operations which +can be performed on a host. + +In order to add a function to the API, we need to add a message to an existing +class. This entails adding a function in `idl/datamodel.ml` or one of the other datamodel files to describe the new +message and adding it to the class's list of messages. In this example, we are adding to `idl/datamodel_host.ml`. + +The function to describe the new message will look something like the following: + + let host_price_of = call ~flags:[`Session] + ~name:"price_of" + ~in_oss_since:None + ~in_product_since:rel_orlando + ~params:[(Ref _host, "host", "The host containing the price information"); + (String, "item", "The item whose price is queried")] + ~result:(Float, "The price of the item") + ~doc:"Returns the price of a named item." + ~allowed_roles:_R_POOL_OP + () + +By convention, the name of the function is formed from the name of the class +and the name of the message: host and price_of, in the example. An entry for +host_price_of is added to the messages of the host class: + + let host = + create_obj ... + ~messages: [... + host_price_of; + ] + ... + +The parameters passed to call are all optional (except ~name and ~in_product_since). + +- The ~flags parameter is used to set conditions for the use of the message. + For example, `Session is used to indicate that the call must be made in the + presence of an existing session. + +- The value of the ~in_product_since parameter is a string taken from + `idl/datamodel_types.ml` indicates the XenServer release in which this + message was first introduced. + +- The ~params parameter describes a list of the formal parameters of the message. + Each parameter is described by a triple. The first component of the triple is + the type (from type ty in `idl/datamodel_types.ml`); the second is the name + of the parameter, and the third is a human-readable description of the parameter. + The first triple in the list is conventionally the instance of the class on + which the message will operate. In the example, this is a reference to the host. + +- Similarly, the ~result describes the message's return type, although this is + permitted to merely be a single value rather than a list of values. If no + ~result is specified, the default is unit. + +- The ~doc parameter describes what the message is doing. + +- The bool ~hide_from_docs parameter prevents the message from being included in the documentation when generated. + +- The bool ~pool_internal parameter is used to indicate if the message should be callable by external systems or only internal hosts. + +- The ~errs parameter is a list of possible exceptions that the message can raise. + +- The parameter ~lifecycle takes in an array of (Status, version, doc) to indicate the lifecycle of the message type. This takes over from ~in_oss_since which indicated the release that the message type was introduced. NOTE: Leave this parameter empty, it will be populated on build. + +- The ~allowed_roles parameter is used for access control (see below). + + +Compiling `xen-api.(hg|git)` will cause the code corresponding to this message +to be generated and output in `ocaml/xapi/server.ml`. In the example above, a +section handling an incoming call host.price_of appeared in `ocaml/xapi/server.ml`. +However, after this was generated, the rest of the build failed because this +call expects a price_of function in the Host object. + +Expected values in parameter ~in_product_since +---------------------------------------------- + +In the example above, the value of the parameter ~in_product_since informs that +the message host_price_of was added during the rel_orlando release cycle. If a +new release cycle is required, then it needs to be added in the file +`idl/datamodel_types.ml`. The patch below shows how the new rel_george release +identifier was added. Any class, message, etc. added during the rel_george +release cycle should contain ~in_product_since:rel_george entries. +(obs: the release and upgrade infrastructure can handle only one new +`rel_*` identifier -- in this case, rel_george -- in each release) + + --- a/ocaml/idl/datamodel_types.ml Tue Nov 11 15:17:48 2008 +0000 + +++ b/ocaml/idl/datamodel_types.ml Tue Nov 11 15:53:29 2008 +0000 + @@ -27,14 +27,13 @@ + (* useful constants for product vsn tracking *) + let oss_since_303 = Some "3.0.3" + +let rel_george = "george" + let rel_orlando = "orlando" + let rel_orlando_update_1 = "orlando-update-1" + let rel_symc = "symc" + let rel_miami = "miami" + let rel_rio = "rio" + -let release_order = [engp:rel_rio; rel_miami; rel_symc; rel_orlando; rel_orlando_update_1] + +let release_order = [engp:rel_rio; rel_miami; rel_symc; rel_orlando; rel_orlando_update_1; rel_george] + +Update expose_get_all_messages_for list +--------------------------------------- + +If you are adding a new class, do not forget to add your new class \_name to +the expose_get_all_messages_for list, at the bottom of datamodel.ml, in +order to have automatically generated get_all and get_all_records functions +attached to it. + +Update the RBAC field containing the roles expected to use the new API call +--------------------------------------------------------------------------- + +After the RBAC integration, Xapi provides by default a set of static roles +associated to the most common subject tasks. + +The api calls associated with each role are defined by a new `~allowed_roles` +parameter in each api call, which specifies the list of static roles that +should be able to execute the call. The possible roles for this list is one of +the following names, defined in `datamodel.ml`: + +- role_pool_admin +- role_pool_operator +- role_vm_power_admin +- role_vm_admin +- role_vm_operator +- role_read_only + +So, for instance, + + ~allowed_roles:[role_pool_admin,role_pool_operator] (* this is not the recommended usage, see example below *) + +would be a valid list (though it is not the recommended way of using +allowed_roles, see below), meaning that subjects belonging to either +role_pool_admin or role_pool_operator can execute the api call. + +The RBAC requirements define a policy where the roles in the list above are +supposed to be totally-ordered by the set of api-calls associated with each of +them. That means that any api-call allowed to role_pool_operator should also be +in role_pool_admin; any api-call allowed to role_vm_power_admin should also be +in role_pool_operator and also in role_pool_admin; and so on. Datamodel.ml +provides shortcuts for expressing these totally-ordered set of roles policy +associated with each api-call: + +- \_R_POOL_ADMIN, equivalent to [role_pool_admin] +- \_R_POOL_OP, equivalent to [role_pool_admin,role_pool_operator] +- \_R_VM_POWER_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin] +- \_R_VM_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin] +- \_R_VM_OP, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op] +- \_R_READ_ONLY, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op,role_read_only] + +The `~allowed_roles` parameter should use one of the shortcuts in the list above, +instead of directly using a list of roles, because the shortcuts above make sure +that the roles in the list are in a total order regarding the api-calls +permission sets. Creating an api-call with e.g. +allowed_roles:[role_pool_admin,role_vm_admin] would be wrong, because that +would mean that a pool_operator cannot execute the api-call that a vm_admin can, +breaking the total-order policy expected in the RBAC 1.0 implementation. +In the future, this requirement might be relaxed. + +So, the example above should instead be used as: + + ~allowed_roles:_R_POOL_OP (* recommended usage via pre-defined totally-ordered role lists *) + +and so on. + +How to determine the correct role of a new api-call: +---------------------------------------------------- + +- if only xapi should execute the api-call, ie. it is an internal call: _R_POOL_ADMIN +- if it is related to subject, role, external-authentication: _R_POOL_ADMIN +- if it is related to accessing Dom0 (via console, ssh, whatever): _R_POOL_ADMIN +- if it is related to the pool object: R_POOL_OP +- if it is related to the host object, licenses, backups, physical devices: _R_POOL_OP +- if it is related to managing VM memory, snapshot/checkpoint, migration: _R_VM_POWER_ADMIN +- if it is related to creating, destroying, cloning, importing/exporting VMs: _R_VM_ADMIN +- if it is related to starting, stopping, pausing etc VMs or otherwise accessing/manipulating VMs: _R_VM_OP +- if it is related to being able to login, manipulate own tasks and read values only: _R_READ_ONLY + +Update message forwarding +------------------------- + +The "message forwarding" layer describes the policy of whether an incoming API +call should be forwarded to another host (such as another member of the pool) +or processed on the host which receives the call. This policy may be +non-trivial to describe and so cannot be auto-generated from the data model. + +In `xapi/message_forwarding.ml`, add a function to the relevant module to +describe this policy. In the running example, we add the following function to +the Host module: + + let price_of ~__context ~host ~item = + info "Host.price_of for item %s" item; + let local_fn = Local.Host.price_of ~host ~item in + do_op_on ~local_fn ~__context ~host + (fun session_id rpc -> Client.Host.price_of ~rpc ~session_id ~host ~item) + +After the ~__context parameter, the parameters of this new function should +match the parameters we specified for the message. In this case, that is the +host and the item to query the price of. + +The do_op_on function takes a function to execute locally and a function to +execute remotely and performs one of these operations depending on whether the +given host is the local host. + +The local function references Local.Host.price_of, which is a function we will +write in the next step. + +Implement the function +---------------------- + +Now we write the function to perform the logic behind the new API call. +For a host-based call, this will reside in `xapi/xapi_host.ml`. For other +classes, other files with similar names are used. + +We add the following function to `xapi/xapi_host.ml`: + + let price_of ~__context ~host ~item = + if item = "fish" then 3.14 else 0.00 + +We also need to add the function to the interface `xapi/xapi_host.mli`: + + val price_of : + __context:Context.t -> host:API.ref_host -> item:string -> float + +Congratulations, you've added a function to the API! + +Add the operation to the CLI +---------------------------- + +Edit `xapi-cli-server/cli_frontend.ml`. Add a block to the definition of cmdtable_data as +in the following example: + + "host-price-of", + { + reqd=["host-uuid"; "item"]; + optn=[]; + help="Find out the price of an item on a certain host."; + implementation= No_fd Cli_operations.host_price_of; + flags=[]; + }; + +Include here the following: + +- The names of required (*reqd*) and optional (*optn*) parameters. +- A description to be displayed when calling *xe help \* in the help field. +- The *implementation* should use *With_fd* if any communication with the + client is necessary (for example, showing the user a warning, sending the + contents of a file, etc.) Otherwise, *No_fd* can be used as above. +- The *flags* field can be used to set special options: + + - *Vm_selectors*: adds a "vm" parameter for the name of a VM (rather than a UUID) + - *Host_selectors*: adds a "host" parameter for the name of a host (rather than a UUID) + - *Standard*: includes the command in the list of common commands displayed by *xe help* + - *Neverforward:* + - *Hidden:* + - *Deprecated of string list:* + +Now we must implement `Cli_operations.host_price_of`. This is done in +`xapi-cli-server/cli_operations.ml`. This function typically extracts the parameters and +forwards them to the internal implementation of the function. Other arbitrary +code is permitted. For example: + + let host_price_of printer rpc session_id params = + let host = Client.Host.get_by_uuid rpc session_id (List.assoc "host-uuid" params) in + let item = List.assoc "item" params in + let price = string_of_float (Client.Host.price_of ~rpc ~session_id ~host ~item) in + printer (Cli_printer.PList [price]) + +Tab Completion in the CLI +------------------------- + +The CLI features tab completion for many of its commands' parameters. +Tab completion is implemented in the file `ocaml/xe-cli/bash-completion`, which +is installed on the host as `/etc/bash_completion.d/cli`, and is done on a +parameter-name rather than on a command-name basis. The main portion of the +bash-completion file is a case statement that contains a section for each of +the parameters that benefit from completion. There is also an entry that +catches all parameter names ending at -uuid, and performs an automatic lookup +of suitable UUIDs. The host-uuid parameter of our new host-price-of command +therefore automatically gains completion capabilities. + +Executing the CLI operation +--------------------------- + +Recompile `xapi` with the changes described above and install it on a test machine. + +Execute the following command to see if the function exists: + + xe help host-price-of + +Invoke the function itself with the following command: + + xe host-price-of host-uuid= item=fish + +and you should find out the price of fish. diff --git a/doc/content/xapi/memory/index.md b/doc/content/xapi/memory/index.md new file mode 100644 index 0000000000..c36f8b953b --- /dev/null +++ b/doc/content/xapi/memory/index.md @@ -0,0 +1,66 @@ ++++ +title = "Host memory accounting" +menuTitle = "Memory" ++++ + +Memory is used for many things: + +- the hypervisor code: this is the Xen executable itself +- the hypervisor heap: this is needed for per-domain structures and per-vCPU + structures +- the crash kernel: this is needed to collect information after a host crash +- domain RAM: this is the memory the VM believes it has +- shadow memory: for HVM guests running on hosts without hardware assisted + paging (HAP) Xen uses shadow to optimise page table updates. For all guests + shadow is used during live migration for tracking the memory transfer. +- video RAM for the virtual graphics card + +Some of these are constants (e.g. hypervisor code) while some depend on the VM +configuration (e.g. domain RAM). Xapi calls the constants "host overhead" and +the variables due to VM configuration as "VM overhead". There is no low-level +API to query this information, therefore xapi will sample the host overheads +at system boot time and model the per-VM overheads. + +Host overhead +------------- + +The host overhead is not managed by xapi, instead it is sampled. After the host +boots and before any VMs start, xapi asks Xen how much memory the host has in +total, and how much memory is currently free. Xapi subtracts the free from the +total and stores this as the host overhead. + +VM overhead +------------ + +The inputs to the model are + +- `VM.memory_static_max`: the maximum amount of RAM the domain will be able to use +- `VM.HVM_shadow_multiplier`: allows the shadow memory to be increased +- `VM.VCPUs_max`: the maximum number of vCPUs the domain will be able to use + +First the shadow memory is calculated, in MiB + +![Shadow memory in MiB](shadow.svg) + +Second the VM overhead is calculated, in MiB + +![Memory overhead in MiB](overhead.svg) + +Memory required to start a VM +----------------------------- + +If ballooning is disabled, the memory required to start a VM is the same as the VM +overhead above. + +If ballooning is enabled then the memory calculation above is modified to use the +`VM.memory_dynamic_max` rather than the `VM.memory_static_max`. + +Memory required to migrate a VM +------------------------------- + +If ballooning is disabled, the memory required to receive a migrating VM is the same +as the VM overhead above. + +If ballooning is enabled, then the VM will first be ballooned down to `VM.memory_dynamic_min` +and then it will be migrated across. If the VM fails to balloon all the way down, then +correspondingly more memory will be required on the receiving side. diff --git a/doc/content/xapi/memory/overhead.svg b/doc/content/xapi/memory/overhead.svg new file mode 100644 index 0000000000..c0979ebf9f --- /dev/null +++ b/doc/content/xapi/memory/overhead.svg @@ -0,0 +1,249 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/content/xapi/memory/shadow.svg b/doc/content/xapi/memory/shadow.svg new file mode 100644 index 0000000000..ba7fcf2385 --- /dev/null +++ b/doc/content/xapi/memory/shadow.svg @@ -0,0 +1,246 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/content/xapi/xapi.png b/doc/content/xapi/xapi.png new file mode 100644 index 0000000000..5041f7c0ac Binary files /dev/null and b/doc/content/xapi/xapi.png differ diff --git a/doc/content/xcp-networkd/_index.md b/doc/content/xcp-networkd/_index.md new file mode 100644 index 0000000000..e3f389795c --- /dev/null +++ b/doc/content/xcp-networkd/_index.md @@ -0,0 +1,119 @@ ++++ +title = "Networkd" +weight = 40 ++++ + +The `xcp-networkd` daemon (hereafter simply called "networkd") is a component in the xapi toolstack that is responsible for configuring network interfaces and virtual switches (bridges) on a host. + +The code is in `ocaml/networkd`. + + +Principles +---------- + +1. **Distro-agnostic**. Networkd is meant to work on at least CentOS/RHEL as well a Debian/Ubuntu based distros. It therefore should not use any network configuration features specific to those distros. + +2. **Stateless**. By default, networkd should not maintain any state. If you ask networkd anything about a network interface or bridge, or any other network sub-system property, it will always query the underlying system (e.g. an IP address), rather than returning any cached state. However, if you want networkd to configure networking at host boot time, the you can ask it to remember your configuration you have set for any interface or bridge you choose. + +3. **Idempotent**. It should be possible to call any networkd function multiple times without breaking things. For example, calling a function to set an IP address on an interface twice in a row should have the same outcome as calling it just once. + +4. **Do no harm**. Networkd should only configure what you ask it to configure. This means that it can co-exist with other network managers. + + +Usage +----- + +Networkd is a daemon that is typically started at host-boot time. In the same way as the other daemons in the xapi toolstack, it is controlled by RPC requests. It typically receives requests from the xapi daemon, on behalf of which it configures host networking. + +Networkd's RCP API is fully described by the [network_interface.ml](https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi-idl/network/network_interface.ml) file. The API has two main namespaces: `Interface` and `Bridge`, which are implemented in two modules in [network_server.ml](https://github.com/xapi-project/xen-api/blob/master/ocaml/networkd/bin/network_server.ml). + +In line with other xapi daemons, all API functions take an argument of type `debug_info` (a string) as their first argument. The debug string appears in any log lines that are produced as a side effort of calling the function. + +Network Interface API +--------------------- + +The Interface API has functions to query and configure properties of Linux network devices, such as IP addresses, and bringing them up or down. Most Interface functions take a `name` string as a reference to a network interface as their second argument, which is expected to be the name of the Linux network device. There is also a special function, called `Interface.make_config`, that is able to configure a number of interfaces at once. It takes an argument called `config` of type `(iface * interface_config_t) list`, where `iface` is an interface name, and `interface_config_t` is a compound type containing the full configuration for an interface (as far as networkd is able to configure them), currently defined as follows: + +``` +type interface_config_t = { + ipv4_conf: ipv4; + ipv4_gateway: Unix.inet_addr option; + ipv6_conf: ipv6; + ipv6_gateway: Unix.inet_addr option; + ipv4_routes: (Unix.inet_addr * int * Unix.inet_addr) list; + dns: Unix.inet_addr list * string list; + mtu: int; + ethtool_settings: (string * string) list; + ethtool_offload: (string * string) list; + persistent_i: bool; +} +``` + +When the function returns, it should have completely configured the interface, and have brought it up. The idempotency principle applies to this function, which means that it can be used to successively modify interface properties; any property that has not changed will effectively be ignored. In fact, `Interface.make_config` is the main function that xapi uses to configure interfaces, e.g. as a result of a `PIF.plug` or a `PIF.reconfigure_ip` call. + +Also note the `persistent` property in the interface config. When an interface is made "persistent", this means that any configuration that is set on it is remembered by networkd, and the interface config is written to disk. When networkd is started, it will read the persistent config and call `Interface.make_config` on it in order to apply it (see Startup below). + +_The full networkd API should be documented separately somewhere on this site._ + +Bridge API +---------- + +The Bridge API functions are all about the management of virtual switches, also known as "bridges". The shape of the Bridge API roughly follows that of the Open vSwitch in that it treats a bridge as a collection of "ports", where a port can contain one or more "interfaces". + +NIC bonding and VLANs are all configured on the Bridge level. There are functions for creating and destroying bridges, adding and removing ports, and configuring bonds and VLANs. Like interfaces, bridges and ports are addressed by name in the Bridge functions. Analogous to the Interface function with the same name, there is a `Bridge.make_config` function, and bridges can be made `persistent`. + +``` +type port_config_t = { + interfaces: iface list; + bond_properties: (string * string) list; + bond_mac: string option; +} +type bridge_config_t = { + ports: (port * port_config_t) list; + vlan: (bridge * int) option; + bridge_mac: string option; + other_config: (string * string) list; + persistent_b: bool; +} +``` + +Backends +-------- + +Networkd currently has two different backends: the "Linux bridge" backend and the "Open vSwitch" backend. The former is the "classic" backend based on the bridge module that is available in the Linux kernel, plus additional standard Linux functionality for NIC bonding and VLANs. The latter backend is newer and uses the [Open vSwitch (OVS)](http://www.openvswitch.org) for bridging as well as other functionality. Which backend is currently in use is defined by the file `/etc/xensource/network.conf`, which is read by networkd when it starts. The choice of backend (currently) only affects the Bridge API: every function in it has a separate implementation for each backend. + +Low-level Interfaces +-------------------- + +Networkd uses standard networking commands and interfaces that are available in most modern Linux distros, rather than relying on any distro-specific network tools (see the distro-agnostic principle). These are tools such as `ip` (iproute2), `dhclient` and `brctl`, as well as the `sysfs` files system, and `netlink` sockets. To control the OVS, the `ovs-*` command line tools are used. All low-level functions are called from [network_utils.ml](https://github.com/xapi-project/xen-api/blob/master/ocaml/networkd/lib/network_utils.ml). + +Configuration on Startup +------------------------ + +Networkd, periodically as well as on shutdown, writes the current configuration of all bridges and interfaces (see above) in a JSON format to a file called `networkd.db` (currently in `/var/lib/xcp`). The contents of the file are completely described by the following type: + +``` +type config_t = { + interface_config: (iface * interface_config_t) list; + bridge_config: (bridge * bridge_config_t) list; + gateway_interface: iface option; + dns_interface: iface option; +} +``` + +The `gateway_interface` and `dns_interface` in the config are global host-level options to define from which interfaces the default gateway and DNS configuration is taken. This is especially important when multiple interfaces are configured by DHCP. + +When networkd starts up, it first reads `network.conf` to determine the network backend. It subsequently attempts to parse `networkd.db`, and tries to call `Bridge.make_config` and `Interface.make_config` on it, with a special options to only apply the config for `persistent` bridges and interfaces, as well as bridges related to those (for example, if a VLAN bridge is configured, then also its parent bridge must be configured). + +Networkd also supports upgrades from older versions of XenServer that used a network configuration script called `interface-configure`. If `networkd.db` is not found on startup, then networkd attempts to call this tool (via the `/etc/init.d/management-interface` script) in order to set up networking at boot time. This is normally followed immediately by a call from xapi instructing networkd to take over. + +Finally, if no network config (old or new) is found on disk at all, networkd looks for a XenServer "firstboot" data file, which is written by XenServer's host installer, and tries to apply it to set up the management interface. + +Monitoring +---------- + +Besides the ability to configure bridges and network interfaces, networkd has facilities for monitoring interfaces and bonds. When networkd starts, a monitor thread is started, which does several things (see [network_monitor_thread.ml](https://github.com/xapi-project/xen-api/blob/master/ocaml/networkd/bin/network_monitor_thread.ml)): + +* Every 5 seconds, it gathers send/receive counters and link state of all network interfaces. It then writes these stats to a shared-memory file, to be picked up by other components such as `xcp-rrdd` and `xapi` (see documentation about "xenostats" elsewhere). +* It monitors NIC bonds, and sends alerts through xapi in case of link state changes within a bond. +* It uses `ip monitor address` to watch for an IP address changes, and if so, it calls xapi (`Host.signal_networking_change`) for it to update the IP addresses of the PIFs in its database that were configured by DHCP.