Skip to content

Commit

Permalink
feat: Single process optimization (#5489)
Browse files Browse the repository at this point in the history
Python interpreter initialization and module import time 
contributes a significant amount of wall clock time to
cloud-init's runtime (and therefore to total boot time).

Cloud-init has four stages. Each stage starts its own Python
interpreter and loads the same libraries. To eliminate the
redundant work of starting an interpreter and loading libraries,
this changes cloud-init to run as a single process. Systemd
service ordering is retained by using the existing cloud-init
services as shims which use a synchronization protocol to start
each cloud-init stage and to communicate that each stage is
complete to the init system. Since multiple cloud-init processes
sit in the critical chain of starting the system, this reduces
boot time (including time to ssh login and time to cloud-init
completion).

Currently only systemd is supported, but the synchronization
protocol should be capable of supporting other init systems
as well with minor changes.

Note: This enables many additional follow-on improvements that
eliminate redundant work. However, these potential improvements
are temporarily ignored. This commit has been structured to
minimize the changes required to capture the majority of primary
performance savings while preserving correctness and the ability
to preserve backwards compatibility.

Since this changes the semantics of the existing cloud-init unit
files, this change takes the opportunity to rename one of its
systemd units which causes frequent user confusion. The unit named
cloud-init.service is often mistaken by users for being the only
cloud-init service, when it is simply one of four stages. This
stage is documented as the "network" stage, so this service will
be renamed to "cloud-init-network.service". A new notify service
is added as part of this implementation which contains the
cloud-init process. This unit is named "cloud-init-main.service".
 
Synchronization protocol
========================

- create one Unix socket for each systemd service stage
- send sd_notify()
- For each of the four stages (local, network, config, final):
   - when init system sends "start" to the Unix socket, start the
     stage
   - when running stage is complete, send "done" to Unix socket

File changes
============

socket.py (new)
---------------

- define a systemd-notify helper function
- define a context manager which implements a multi-socket
  synchronization protocol

cloud-init.service -> cloud-init-network.service (renamed)
----------------------------------------------------------

- renamed to cloud-network.service

cloud-{init-local,init-network,config,final}.services
-------------------------------------------

- change ExecStart to use netcat to connect to Unix socket and:
  - send a start message
  - wait for completion response
- note: a pure Python equivalent is possible for any downstreams
  which do not package openbsd's netcat

cloud-init-main.service (new)
-----------------------------

 - use service type to 'notify'
 - invoke cloud-init in single process mode
 - adopt systemd ordering requirements from cloud-init-local.service
 - adopt KillMode from cloud-final.service

main.py
-------

 - Add command line flag to indicate "all stages" mode
 - In this mode run each stage followed by an IPC
   synchronization protocol step

cloud-final.services
--------------------

- drop KillMode

cloud-init-local.services
-------------------------

- drop dependencies made redundant by ordering after
  cloud-init-main.service

Performance Impact
==================

On Ubuntu 24.04, Python's wall clock start up time as measured with
`time python3 -c 'import cloudinit.cmd.main' on a few cloud types:

lxc container: 0.256s
QEMU machine:  0.300s
gce instance:  0.367s
ec2 instance:  0.491s

This change eliminates x1 this start up time from time to ssh.
This change eliminates x3 this start up time from cloud-init's total
completion. Total benefit varies based on the platform that the
instance is hosted by, but all platforms will measurably benefit from
this change.

BREAKING_CHANGE: Run all four cloud-init services as a single systemd service.
  • Loading branch information
holmanb committed Aug 6, 2024
1 parent bb475c6 commit 86c076a
Show file tree
Hide file tree
Showing 25 changed files with 601 additions and 47 deletions.
122 changes: 105 additions & 17 deletions cloudinit/cmd/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from cloudinit import netinfo
from cloudinit import signal_handler
from cloudinit import sources
from cloudinit import socket
from cloudinit import stages
from cloudinit import url_helper
from cloudinit import util
Expand All @@ -38,7 +39,12 @@
from cloudinit.config.schema import validate_cloudconfig_schema
from cloudinit import log
from cloudinit.reporting import events
from cloudinit.settings import PER_INSTANCE, PER_ALWAYS, PER_ONCE, CLOUD_CONFIG
from cloudinit.settings import (
PER_INSTANCE,
PER_ALWAYS,
PER_ONCE,
CLOUD_CONFIG,
)

# Welcome message template
WELCOME_MSG_TPL = (
Expand Down Expand Up @@ -362,8 +368,11 @@ def main_init(name, args):
outfmt = None
errfmt = None
try:
close_stdin(lambda msg: early_logs.append((logging.DEBUG, msg)))
outfmt, errfmt = util.fixup_output(init.cfg, name)
if not args.skip_log_setup:
close_stdin(lambda msg: early_logs.append((logging.DEBUG, msg)))
outfmt, errfmt = util.fixup_output(init.cfg, name)
else:
outfmt, errfmt = util.get_output_cfg(init.cfg, name)
except Exception:
msg = "Failed to setup output redirection!"
util.logexc(LOG, msg)
Expand All @@ -375,8 +384,9 @@ def main_init(name, args):
"Logging being reset, this logger may no longer be active shortly"
)
log.reset_logging()
log.setup_logging(init.cfg)
apply_reporting_cfg(init.cfg)
if not args.skip_log_setup:
log.setup_logging(init.cfg)
apply_reporting_cfg(init.cfg)

# Any log usage prior to setup_logging above did not have local user log
# config applied. We send the welcome message now, as stderr/out have
Expand Down Expand Up @@ -633,8 +643,9 @@ def main_modules(action_name, args):
mods = Modules(init, extract_fns(args), reporter=args.reporter)
# Stage 4
try:
close_stdin()
util.fixup_output(mods.cfg, name)
if not args.skip_log_setup:
close_stdin()
util.fixup_output(mods.cfg, name)
except Exception:
util.logexc(LOG, "Failed to setup output redirection!")
if args.debug:
Expand All @@ -643,8 +654,9 @@ def main_modules(action_name, args):
"Logging being reset, this logger may no longer be active shortly"
)
log.reset_logging()
log.setup_logging(mods.cfg)
apply_reporting_cfg(init.cfg)
if not args.skip_log_setup:
log.setup_logging(mods.cfg)
apply_reporting_cfg(init.cfg)

# now that logging is setup and stdout redirected, send welcome
welcome(name, msg=w_msg)
Expand Down Expand Up @@ -804,9 +816,10 @@ def status_wrapper(name, args):
)

v1[mode]["start"] = float(util.uptime())
preexisting_recoverable_errors = next(
handler = next(
filter(lambda h: isinstance(h, log.LogExporter), root_logger.handlers)
).export_logs()
)
preexisting_recoverable_errors = handler.export_logs()

# Write status.json prior to running init / module code
atomic_helper.write_json(status_path, status)
Expand Down Expand Up @@ -847,11 +860,8 @@ def status_wrapper(name, args):
v1["stage"] = None

# merge new recoverable errors into existing recoverable error list
new_recoverable_errors = next(
filter(
lambda h: isinstance(h, log.LogExporter), root_logger.handlers
)
).export_logs()
new_recoverable_errors = handler.export_logs()
handler.clean_logs()
for key in new_recoverable_errors.keys():
if key in preexisting_recoverable_errors:
v1[mode]["recoverable_errors"][key] = list(
Expand Down Expand Up @@ -953,9 +963,19 @@ def main(sysv_args=None):
default=False,
)

parser.add_argument(
"--all-stages",
dest="all_stages",
action="store_true",
help=(
"Run cloud-init's stages under a single process using a "
"syncronization protocol. This is not intended for CLI usage."
),
default=False,
)

parser.set_defaults(reporter=None)
subparsers = parser.add_subparsers(title="Subcommands", dest="subcommand")
subparsers.required = True

# Each action and its sub-options (if any)
parser_init = subparsers.add_parser(
Expand Down Expand Up @@ -1143,8 +1163,76 @@ def main(sysv_args=None):

status_parser(parser_status)
parser_status.set_defaults(action=("status", handle_status_args))
else:
parser.error("a subcommand is required")

args = parser.parse_args(args=sysv_args)
setattr(args, "skip_log_setup", False)
if not args.all_stages:
return sub_main(args)
return all_stages(parser)


def all_stages(parser):
"""Run all stages in a single process using an ordering protocol."""
LOG.info("Running cloud-init in single process mode.")

# this _must_ be called before sd_notify is called otherwise netcat may
# attempt to send "start" before a socket exists
sync = socket.SocketSync("local", "network", "config", "final")

# notify systemd that this stage has completed
socket.sd_notify("READY=1")
# wait for cloud-init-local.service to start
with sync("local"):
# set up logger
args = parser.parse_args(args=["init", "--local"])
args.skip_log_setup = False
# run local stage
sync.systemd_exit_code = sub_main(args)

# wait for cloud-init-network.service to start
with sync("network"):
# skip re-setting up logger
args = parser.parse_args(args=["init"])
args.skip_log_setup = True
# run init stage
sync.systemd_exit_code = sub_main(args)

# wait for cloud-config.service to start
with sync("config"):
# skip re-setting up logger
args = parser.parse_args(args=["modules", "--mode=config"])
args.skip_log_setup = True
# run config stage
sync.systemd_exit_code = sub_main(args)

# wait for cloud-final.service to start
with sync("final"):
# skip re-setting up logger
args = parser.parse_args(args=["modules", "--mode=final"])
args.skip_log_setup = True
# run final stage
sync.systemd_exit_code = sub_main(args)

# signal completion to cloud-init-main.service
if sync.experienced_any_error:
message = "a stage of cloud-init exited non-zero"
if sync.first_exception:
message = f"first exception received: {sync.first_exception}"
socket.sd_notify(
f"STATUS=Completed with failure, {message}. Run 'cloud-init status"
" --long' for more details."
)
socket.sd_notify("STOPPING=1")
# exit 1 for a fatal failure in any stage
return 1
else:
socket.sd_notify("STATUS=Completed")
socket.sd_notify("STOPPING=1")


def sub_main(args):

# Subparsers.required = True and each subparser sets action=(name, functor)
(name, functor) = args.action
Expand Down
3 changes: 2 additions & 1 deletion cloudinit/cmd/status.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,8 +318,9 @@ def systemd_failed(wait: bool) -> bool:
for service in [
"cloud-final.service",
"cloud-config.service",
"cloud-init.service",
"cloud-init-network.service",
"cloud-init-local.service",
"cloud-init-main.service",
]:
try:
stdout = query_systemctl(
Expand Down
2 changes: 1 addition & 1 deletion cloudinit/config/cc_mounts.py
Original file line number Diff line number Diff line change
Expand Up @@ -524,7 +524,7 @@ def handle(name: str, cfg: Config, cloud: Cloud, args: list) -> None:
# fs_spec, fs_file, fs_vfstype, fs_mntops, fs-freq, fs_passno
uses_systemd = cloud.distro.uses_systemd()
default_mount_options = (
"defaults,nofail,x-systemd.after=cloud-init.service,_netdev"
"defaults,nofail,x-systemd.after=cloud-init-network.service,_netdev"
if uses_systemd
else "defaults,nobootwait"
)
Expand Down
4 changes: 2 additions & 2 deletions cloudinit/config/schemas/schema-cloud-config-v1.json
Original file line number Diff line number Diff line change
Expand Up @@ -2022,12 +2022,12 @@
},
"mount_default_fields": {
"type": "array",
"description": "Default mount configuration for any mount entry with less than 6 options provided. When specified, 6 items are required and represent ``/etc/fstab`` entries. Default: ``defaults,nofail,x-systemd.after=cloud-init.service,_netdev``",
"description": "Default mount configuration for any mount entry with less than 6 options provided. When specified, 6 items are required and represent ``/etc/fstab`` entries. Default: ``defaults,nofail,x-systemd.after=cloud-init-network.service,_netdev``",
"default": [
null,
null,
"auto",
"defaults,nofail,x-systemd.after=cloud-init.service",
"defaults,nofail,x-systemd.after=cloud-init-network.service",
"0",
"2"
],
Expand Down
3 changes: 3 additions & 0 deletions cloudinit/log.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,9 @@ def emit(self, record: logging.LogRecord):
def export_logs(self):
return copy.deepcopy(self.holder)

def clean_logs(self):
self.holder = defaultdict(list)

def flush(self):
pass

Expand Down
Loading

0 comments on commit 86c076a

Please sign in to comment.