Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init-network.service should have alias to cloud-init.service #5684

Open
sshedi opened this issue Sep 10, 2024 · 12 comments
Open

cloud-init-network.service should have alias to cloud-init.service #5684

sshedi opened this issue Sep 10, 2024 · 12 comments
Labels
bug Something isn't working correctly enhancement New feature or request

Comments

@sshedi
Copy link
Contributor

sshedi commented Sep 10, 2024

Bug report

cloud-init.service is a widely used service name and many scripts downstream are using the same to restart the service.
In v24.3 release, it has been renamed to cloud-init-network.service. I think this is going to break a lot of CI CD pipelines and scripts. Renaming such widely used service is an invasive change but at least we should have an alias in the service file with the old name if at all this is needed.

Please consider this request and provide some insights on renaming the service.

Environment details

  • Cloud-init version: 24.3
  • Operating System Distribution: PhotonOS
@sshedi sshedi added bug Something isn't working correctly new An issue that still needs triage labels Sep 10, 2024
@holmanb
Copy link
Member

holmanb commented Sep 10, 2024

Hey @sshedi, thanks for filing this issue.

cloud-init.service is a widely used service name and many scripts downstream are using the same to restart the service.
In v24.3 release, it has been renamed to cloud-init-network.service. I think this is going to break a lot of CI CD pipelines and scripts. Renaming such widely used service is an invasive change but at least we should have an alias in the service file with the old name if at all this is needed.

An alias might be reasonable in the short term until any services ordered after cloud-init.service can be updated, however this sort of band-aid is likely to be forgotten so I would recommend just updating any downstream scripts that need to be updated rather than carrying this forward into the future.

As for your concerns regarding CI / CD pipelines and scripts, do you have any examples of this usage? Cloud-init is not a long-running service and nowhere in the cloud-init documentation is it recommended to start or stop or restart cloud-init services - there shouldn't be a need for it. Cloud-init runs automatically as part of first boot on a cloud instance. Restarting services may be possible, but that's not how users are expected to interact with cloud-init. Users shouldn't need to "run cloud-init".

Ultimately we don't want to break anyone that is doing unexpected things with cloud-init without reason, but we also don't want improvements to be held back by use cases that are "off the beaten path", so to say.

Since this doesn't appear to be a bug or user complaint, I'm going to change this from a bug report to a feature request.

provide some insights on renaming the service.

I'm not sure if you saw the announcement of this change on irc, in the release, the changelog, the mailing list, the discourse announcement, or on the breaking changes documentation page. If not, I would recommend those as a starting point. If you already read those, is there a specific piece of information that is missing?

@holmanb holmanb added enhancement New feature or request and removed new An issue that still needs triage labels Sep 10, 2024
@sshedi
Copy link
Contributor Author

sshedi commented Sep 11, 2024

Thanks for the detailed explanation @holmanb

In VMware solutions, we use GOSC (Guest OS Customisation) using open-vm-tools. Here we use a .cab file which contains some vm customisation settings, we convert it into a yaml file and apply it using cloud-init init --file <yml-file> but with v24.3 cloud-init, I see this warning

2024-09-11 07:39:12,452 - lifecycle.py[DEPRECATED]: Unsupported configuration: boot stage called by PID [791] outside of systemd is deprecated in 24.3 and scheduled to be removed in 29.3. Triggering cloud-init boot stages outside of intial system boot is not a fully supported operation which can lead to incomplete or incorrect configuration. As such, cloud-init is deprecating this feature in the future. If you currently use cloud-init in this way, please file an issue describing in detail your use case so that cloud-init can better support your needs: https://github.com/canonical/cloud-init/issues/new

We do customisations like password setting, network setting, hostname settings etc using this approach. This is a legacy solution. If direct invocation of cloud-init gets deprecated, this will cause a lot of trouble.

Another question, in a different context:
Now all the services use nc.openbsd command and use socket mechanism while starting service. These binaries are not present in our distro and even in major distro like Fedora. This is seems like a Ubuntu specific solution in template files right now. For now I'm thinking of using https://github.com/canonical/cloud-init/blob/ubuntu/noble/debian/patches/no-single-process.patch and defer the noise. BSD netcat has dependency on libbsd, libmd, libretls. So, cloud-init will end up installing four additional packages.
I'm also having a hard time on understanding the order of service initiation now. Is it ok to restart cloud-init-local service using systemctl restart cloud-init-local now? But it says Sep 11 02:28:46 phdev sh[730]: nc.openbsd: /run/cloud-init/share/local.sock: No such file or directory so, should we start cloud-init-main service before starting any other cloud-init service?

When I did systemctl enable cloud-init-main and rebooted my vm; I'm seeing ordering cycle in PhotonOS (I will debug this further)

Sep 11 02:51:00 phdev systemd[1]: sysinit.target: Found ordering cycle on cloud-init-main.service/start
Sep 11 02:51:00 phdev systemd[1]: sysinit.target: Found dependency on basic.target/start
Sep 11 02:51:00 phdev systemd[1]: sysinit.target: Found dependency on sockets.target/start
Sep 11 02:51:00 phdev systemd[1]: sysinit.target: Found dependency on dbus.socket/start
Sep 11 02:51:00 phdev systemd[1]: sysinit.target: Found dependency on sysinit.target/start
Sep 11 02:51:00 phdev systemd[1]: sysinit.target: Job cloud-init-main.service/start deleted to break ordering cycle starting with sysinit.target/start

@holmanb
Copy link
Member

holmanb commented Sep 11, 2024

In VMware solutions, we use GOSC (Guest OS Customisation) using open-vm-tools. Here we use a .cab file which contains some vm customisation settings, we convert it into a yaml file and apply it using cloud-init init --file

Why isn't this configuration being provided by the VMware datasource at boot time? Is there something that this can do that the datasource cannot? Can you please provide a link to this code? This should really be a separate issue from the other things mentioned here, so a separate issue would be more appropriate.

These binaries are not present in our distro and even in major distro like Fedora

On Fedora[1], like Ubuntu and Debian, openbsd's netcat is actually the default netcat. Perhaps Photon could follow the other major distros? Alternatively, like the docs linked mention, this is trivially implemented using a Python one-liner. I can dig up the one I was using before if that would be helpful for you.

I'm also having a hard time on understanding the order of service initiation now. Is it ok to restart cloud-init-local service using systemctl restart cloud-init-local now?

What is your end goal? What are you trying to accomplish by restarting cloud-init? Can you please provide a link to this code? Like I said before, this shouldn't be required, but if for some reason it is required this might be a gap that we need to address.

I'm seeing ordering cycle in PhotonOS

Trying to carry distro-specific systemd orderings in upstream is really hard to test and maintain - especially with bigger changes like this. We tried not to break the systemd ordering but in this case it clearly did.

I'm happy to help fix this for PhotonOS. Are you building this from upstream source or do you have patches applied? If patches, can you please include a link?

[1] on Fedora

[root@fedora ~]# dnf info netcat
Last metadata expiration check: 0:15:17 ago on Wed Sep 11 14:55:31 2024.
Installed Packages
Name         : netcat
Version      : 1.226
Release      : 3.fc40
Architecture : x86_64
Size         : 62 k
Source       : netcat-1.226-3.fc40.src.rpm
Repository   : @System
From repo    : fedora
Summary      : OpenBSD netcat to read and write data across connections using TCP or UDP
URL          : https://man.openbsd.org/nc.1
License      : BSD-3-Clause AND BSD-2-Clause
Description  : The OpenBSD nc (or netcat) utility can be used for just about anything involving
             : TCP, UDP, or UNIX-domain sockets. It can open TCP connections, send UDP packets,
             : listen on arbitrary TCP and UDP ports, do port scanning, and deal with both IPv4
             : and IPv6. Unlike telnet(1), nc scripts nicely, and separates error messages onto
             : standard error instead of sending them to standard output, as telnet(1) might do
             : with some.

@sshedi
Copy link
Contributor Author

sshedi commented Sep 12, 2024

Why isn't this configuration being provided by the VMware datasource at boot time? Is there something that this can do that the datasource cannot? Can you please provide a link to this code? This should really be a separate issue from the other things mentioned here, so a separate issue would be more appropriate.

I will create a new issue. In a gist, the original settings from cab file come in ini format, we convert this to yaml by parsing the ini at run time and ultimately feed it to cloud-init.
You can extract the tarball from this link https://packages.vmware.com/photon/photon_sources/1.0/gosc-scripts-1.3.2.tar.gz
and check RunCloudConfig function and GenerateCloudInitConfig functions.

On Fedora[1], like Ubuntu and Debian, openbsd's netcat is actually the default netcat. Perhaps Photon could follow the other major distros? Alternatively, like the docs linked mention, this is trivially implemented using a Python one-liner. I can dig up the one I was using before if that would be helpful for you.

Yes, I have done that. My concern was with the binary name nc.openbsd. Fedora just uses nc or netcat, probably they need to create a symlink or one more alternative for this.
netcat spec of Fedora:https://src.fedoraproject.org/rpms/netcat/blob/rawhide/f/netcat.spec#_71

What is your end goal? What are you trying to accomplish by restarting cloud-init? Can you please provide a link to this code? Like I said before, this shouldn't be required, but if for some reason it is required this might be a gap that we need to address.

My previous request of GOSC using yaml file and feeding it to vmtoolsd (open-vm-tools) and in our CI CD pipe lines, we just do basic sanity tests for these services by simply restarting and checking the status of the service.

We also carry a downstream patch: https://github.com/vmware/photon/blob/master/SPECS/cloud-init/0003-Patch-VMware-DS-to-handle-network-settings-from-vmto.patch
and in our CI CD we test this by setting these explicitly. Example:

  vmtoolsd --cmd "info-set guestinfo.hostname photon-123"
  vmtoolsd --cmd "info-set guestinfo.interface.0.name eth0"
  vmtoolsd --cmd "info-set guestinfo.interface.0.address 10.1.1.1"
  vmtoolsd --cmd "info-set guestinfo.interface.0.route.0 10.2.2.2, 10.3.3.3"
  vmtoolsd --cmd "info-set guestinfo.dns.servers 10.4.4.4, 10.5.5.5"

And later we simply restart cloud-init services and check network file creation, hostname.
In general, we don't send config yamls at boot time for our testing. We feed it to cloud init directly or set them using vmtoolsd and restart the services.

And the cyclic ordering issue, I will check further. Not your fault, just wanted to let you know. You can ignore it for now.

Thanks for all the answers.

@holmanb
Copy link
Member

holmanb commented Sep 12, 2024

Why isn't this configuration being provided by the VMware datasource at boot time? Is there something that this can do that the datasource cannot? Can you please provide a link to this code? This should really be a separate issue from the other things mentioned here, so a separate issue would be more appropriate.

I will create a new issue. In a gist, the original settings from cab file come in ini format, we convert this to yaml by parsing the ini at run time and ultimately feed it to cloud-init. You can extract the tarball from this link https://packages.vmware.com/photon/photon_sources/1.0/gosc-scripts-1.3.2.tar.gz and check RunCloudConfig function and GenerateCloudInitConfig functions.

Thanks for the extra context @sshedi. I'll wait for the new bug to continue the conversation there.

On Fedora[1], like Ubuntu and Debian, openbsd's netcat is actually the default netcat. Perhaps Photon could follow the other major distros? Alternatively, like the docs linked mention, this is trivially implemented using a Python one-liner. I can dig up the one I was using before if that would be helpful for you.

Yes, I have done that. My concern was with the binary name nc.openbsd. Fedora just uses nc or netcat, probably they need to create a symlink or one more alternative for this. netcat spec of Fedora:https://src.fedoraproject.org/rpms/netcat/blob/rawhide/f/netcat.spec#_71

Good point, the current implementation is using the Ubuntu binary name. There is no reason we can't use nc or netcat. I proposed this change.

What is your end goal? What are you trying to accomplish by restarting cloud-init? Can you please provide a link to this code? Like I said before, this shouldn't be required, but if for some reason it is required this might be a gap that we need to address.

My previous request of GOSC using yaml file and feeding it to vmtoolsd (open-vm-tools) and in our CI CD pipe lines, we just do basic sanity tests for these services by simply restarting and checking the status of the service.
We also carry a downstream patch: https://github.com/vmware/photon/blob/master/SPECS/cloud-init/0003-Patch-VMware-DS-to-handle-network-settings-from-vmto.patch and in our CI CD we test this by setting these explicitly. Example:

  vmtoolsd --cmd "info-set guestinfo.hostname photon-123"
  vmtoolsd --cmd "info-set guestinfo.interface.0.name eth0"
  vmtoolsd --cmd "info-set guestinfo.interface.0.address 10.1.1.1"
  vmtoolsd --cmd "info-set guestinfo.interface.0.route.0 10.2.2.2, 10.3.3.3"
  vmtoolsd --cmd "info-set guestinfo.dns.servers 10.4.4.4, 10.5.5.5"

And later we simply restart cloud-init services and check network file creation, hostname. In general, we don't send config yamls at boot time for our testing. We feed it to cloud init directly or set them using vmtoolsd and restart the services.

Integration testing cloud-init is tricky. The reason that we don't do manual service restarts in our upstream integration tests is because this isn't representative of how users use cloud-init, and it can have unintended side effects. One example of this is that many cloud-config modules only run once per instance.

The approach that we take in the upstream integration tests is something along the lines of this:

  1. install a test package
  2. run cloud-init clean --logs --reboot
  3. wait for instance to reboot (cloud-init status --wait)

@sshedi Do you think that this approach could work for you? Also, do you think that the patch would benefit other distros?

fyi: cloud-init's integration test backbone, pycloudlib, gained support for VMWare last year. I don't think we currently use it in CI, but if you have interest in learning more about it, we could probably share some guidance to help get that set up. Running the upstream integration tests in your CI would certainly give you lots of test coverage.

And the cyclic ordering issue, I will check further. Not your fault, just wanted to let you know. You can ignore it for now.

Okay, let me know what we can do to assist.

In an ideal world, cloud-init would have a single systemd ordering which behaves correctly on all distros. Cloud-init's ordering is complex, but it tries to behave correctly regardless of which components a distro is made up of. I'd be curious to know why PhotonOS uses DefaultDependencies=no in cloud-init-network.service.tmpl. @sshedi do you know why that is and if we might be able to drop that requirement? Cloud-init should manually add the other dependencies that it requires so I don't think that this is needed - is there some other dependency that cloud-init doesn't already include? Part of the reason for running earlier is so that cloud-init can start running time consuming operations earlier so that it is less of a bottleneck. I think that this would actually improve PhotonOS's boot time if it works.

@sshedi
Copy link
Contributor Author

sshedi commented Sep 13, 2024

Integration testing cloud-init is tricky. The reason that we don't do manual service restarts in our upstream integration tests is because this isn't representative of how users use cloud-init, and it can have unintended side effects. One example of this is that many cloud-config modules only run once per instance.

I'm sorry but rebooting VM every time seems to be overkill. We just want to try some combination of yaml configs, nothing else. Also, we will try out different network settings, so we might not get connection back after reboot. Rebooting for each test configuration will increase the time and complexity by several folds. And in our test scripts, I do cloud-init clean -ls; rm -rf /run/cloud /usr/lib/cloud-init/ before starting services. So, we will make cloud-init assume that it's a fresh deployment and first boot.

Also, I don't fully understand the rationale behind not allowing a service to be manually restarted. IMO, this flexibility should be there and it will help big time while debugging issues and trying out things in production instances where rebooting is not an option. In the current implementation, if needed, we can simply share a yaml config with a customer and give commands to feed it to cloud-init without asking them reboot their instance.

One of the standout features of cloud-init is its flexibility with configurations, allowing us to quickly test changes without needing to reboot machines. Unlike critical services like audit or dbus, where manual restarts can have serious consequences, cloud-init has always supported explicit invocation and manual service restarts without issue. Suddenly shifting to a new model that restricts this capability feels like losing a vital tool in our toolkit.

I'd be curious to know why PhotonOS uses DefaultDependencies=no in cloud-init-network.service.tmpl. @sshedi do you know why that is and if we might be able to drop that requirement? Cloud-init should manually add the other dependencies that it requires so I don't think that this is needed - is there some other dependency that cloud-init doesn't already include? Part of the reason for running earlier is so that cloud-init can start running time consuming operations earlier so that it is less of a bottleneck. I think that this would actually improve PhotonOS's boot time if it works.

Agree. I just kept it in alignment with RHEL in my initial PR while adding PhotonOS support to cloud-init. Thanks for this.

@holmanb
Copy link
Member

holmanb commented Sep 13, 2024

Integration testing cloud-init is tricky. The reason that we don't do manual service restarts in our upstream integration tests is because this isn't representative of how users use cloud-init, and it can have unintended side effects. One example of this is that many cloud-config modules only run once per instance.

I'm sorry but rebooting VM every time seems to be overkill.

Perhaps for your testing purposes it is overkill, but for cloud-init upstream this is absolutely vital. It's slow, but our CI gives us results daily so it works.

We just want to try some combination of yaml configs, nothing else. Also, we will try out different network settings, so we might not get connection back after reboot. Rebooting for each test configuration will increase the time and complexity by several folds. And in our test scripts, I do cloud-init clean -ls; rm -rf /run/cloud /usr/lib/cloud-init/ before starting services. So, we will make cloud-init assume that it's a fresh deployment and first boot.

We do have at least one integration test that manually calls the old cloud-init entry points (cloud-init init --local, cloud-init init, etc). I do think that we will probably want to preserve this capability because it is useful in some cases for testing. We haven't updated those tests to use the single process entry point, however I think that we will probably want that to work in the future for the same test. I think we might need to make a change to make this work at some point, but when we do, I expect that this should cover your use case, albeit using the cli not using systemd. Does that sound reasonable?

Some thoughts on how to implement this -> currently when the single process flag is used and stdin is a tty (which happens when invoked by a shell), cloud-init skips the socket synchronization logic. This was done to allow cloud-init to run under pdb without requiring a developer to manually send the expected data over the sockets. To implement supporting this for testing, we could reuse this code path somehow (not sure exactly how, maybe a new flag --skip-synchronization could be added which forces this code path).

Also, I don't fully understand the rationale behind not allowing a service to be manually restarted.

Semantic question: by this do you only mean "run it again"? Or do you also mean "if running, kill it and run again"?

IMO, this flexibility should be there and it will help big time while debugging issues and trying out things in production instances where rebooting is not an option. In the current implementation, if needed, we can simply share a yaml config with a customer and give commands to feed it to cloud-init without asking them reboot their instance.

One of the standout features of cloud-init is its flexibility with configurations, allowing us to quickly test changes without needing to reboot machines. Unlike critical services like audit or dbus, where manual restarts can have serious consequences, cloud-init has always supported explicit invocation and manual service restarts without issue. Suddenly shifting to a new model that restricts this capability feels like losing a vital tool in our toolkit.

Thanks for engaging, these are all fair points. The reason for going this direction is that cloud-init imparts many side-effects on a system, yet has little testing of this feature. The code that makes restarting cloud-init possible is trivial, but since cloud-init makes many decisions at runtime based on image artifacts and system state, it's difficult to make promises about what cloud-init actually does when run this way. Cloud-init was implemented without idempotency in mind which further complicates things. It's easy to see how this functionality adds value, but it's also difficult to reason about and maintain.

I'm not strongly opposed to keeping the ability to run cloud-init with --files indefinitely, but in my mind calling it unsupported is more about not wanting to make promises about expected behaviors when it runs, because seemingly inconsequential changes which cause no change in behavior for a first-boot scenario could cause a behavior change in a restart scenario (which is unacceptable for downstream stable releases). In most cases it probably does what you might expect. Maybe a label such as "unstable" or "experimental" or "expert mode" or something would be more appropriate? @sshedi what are your thoughts?

Agree. I just kept it in alignment with RHEL in my initial PR while adding PhotonOS support to cloud-init. Thanks for this.

Happy to help!

@raharper
Copy link
Collaborator

Hi @holmanb

Cloud-init was implemented without idempotency in mind which further
complicates things.

Cloud-init is definitely implemented with idempotency in mind.

We call out the importance in https://docs.cloud-init.io/en/latest/explanation/vendordata.html

Since users trust you, please take care to make sure that any vendor data is
safe, atomic, idempotent and does not put your users at risk.

And in places where we know we can't (or are not yet), it's noted:

Mileage may vary trying to re-run each cloud-config module, as some are not idempotent.

https://docs.cloud-init.io/en/latest/reference/cli.html

I say this not to derail the discussion here, but to point out that if we
could guarantee idempotency everwhere I believe we would; it's a very useful
tool for deployment. We should proceed with caution here.

I'm not strongly opposed to keeping the ability to run cloud-init with --files indefinitely

Good. I would urge a path that does keep the existing functionality; it's been
around since the beginning and it's very useful in many scenarios. I suspect
it's heavily in use in the "create a new template image" use case and surely
other scenarios. Users of the feature learn to work around any quirks there
may be without an explicit promise of idempotency.

but in my mind calling it unsupported is more about not wanting to make
promises about expected behaviors when it runs, because seemingly
inconsequential changes which cause no change in behavior for a
first-boot scenario could cause a behavior change in a restart scenario
(which is unacceptable for downstream stable releases).

I'm trying to understand the concern given that cloud-init already behaves
this way and folks are currently using it this way; and presenting concerns
about this behavior going away.

Could you expand on what potential (or existing) scenarios would be
problematic by leaving the current capabilities in place?

In most cases it probably does what you might expect. Maybe a label such as
"unstable" or "experimental" or "expert mode" or something would be more
appropriate?

I don't think users stumble upon the --files mode; rather they are
scratching a specific itch so I don't think a label will dissuade folks
from using it.

If you label it one of the above, what does that mean for upstream (and
potentually downstream)? Will bug reports be rejected as WONTFIX?
Some clarity with what upstream is planning here will be helpful
users of the current capabilities.

It would also be helpful to understand what, if anything, would be
prevented/restricted/ if the feature is not removed; ie, are --files and
cloud-init single holding new features/functions back? What are the
trade-offs under discussion (if any)?

@sshedi
Copy link
Contributor Author

sshedi commented Sep 16, 2024

We do have at least one integration test that manually calls the old cloud-init entry points (cloud-init init --local, cloud-init init, etc). I do think that we will probably want to preserve this capability because it is useful in some cases for testing. We haven't updated those tests to use the single process entry point, however I think that we will probably want that to work in the future for the same test. I think we might need to make a change to make this work at some point, but when we do, I expect that this should cover your use case, albeit using the cli not using systemd. Does that sound reasonable?

Yes, this should work. I don't have a hard requirement on restarting services, all I need is a way to trigger init-local stage and other stages manually.

Semantic question: by this do you only mean "run it again"? Or do you also mean "if running, kill it and run again"?

Just restart, no need to kill the process. But the newer configurations set by vmtoolsd or given by the the yaml file should take effect.

I'm not strongly opposed to keeping the ability to run cloud-init with --files indefinitely, but in my mind calling it unsupported is more about not wanting to make promises about expected behaviors when it runs, because seemingly inconsequential changes which cause no change in behavior for a first-boot scenario could cause a behavior change in a restart scenario (which is unacceptable for downstream stable releases). In most cases it probably does what you might expect. Maybe a label such as "unstable" or "experimental" or "expert mode" or something would be more appropriate? @sshedi what are your thoughts?

This option has been there from a long time now. I think there should be an option to maintain backward compatibility. It can be a build time flag as well or a run time flag like --expert-mode or --legacy-option something like that.

Cloud-init is a crucial component for our use cases, and this proposed change feels too disruptive, making things significantly more challenging for us. I understand the complexity of maintaining legacy code, but with such a critical service, it becomes somewhat inevitable. If there's any way I can assist in this process, please don't hesitate to reach out.

@holmanb
Copy link
Member

holmanb commented Sep 17, 2024

Hey @OddBloke thanks for engaging.

Cloud-init was implemented without idempotency in mind which further
complicates things.

Cloud-init is definitely implemented with idempotency in mind.

We call out the importance in https://docs.cloud-init.io/en/latest/explanation/vendordata.html

Since users trust you, please take care to make sure that any vendor data is
safe, atomic, idempotent and does not put your users at risk.

And in places where we know we can't (or are not yet), it's noted:

Mileage may vary trying to re-run each cloud-config module, as some are not idempotent.

https://docs.cloud-init.io/en/latest/reference/cli.html

I say this not to derail the discussion here, but to point out that if we could guarantee idempotency everwhere I believe we would; it's a very useful tool for deployment. We should proceed with caution here.

I meant to say "cloud-init is not idempotent", but it doesn't change the point I was making. We agree that idempotency is important, and that there should be more of it. I mentioned idempotency because the lack of idempotency complicates maintenance. Cloud-init would benefit greatly from tests that explicitly check for idempotency not just in cloud-config modules but additionally in many other parts of the code.

I'm not strongly opposed to keeping the ability to run cloud-init with --files indefinitely

Good. I would urge a path that does keep the existing functionality; it's been around since the beginning and it's very useful in many scenarios. I suspect it's heavily in use in the "create a new template image" use case and surely other scenarios. Users of the feature learn to work around any quirks there may be without an explicit promise of idempotency.

The last statement underscores the deficiencies of this feature. A tool which is thoughtfully designed, intuitive, ergonomic, and foolproof shouldn't require learning to work around the quirks.

but in my mind calling it unsupported is more about not wanting to make
promises about expected behaviors when it runs, because seemingly
inconsequential changes which cause no change in behavior for a
first-boot scenario could cause a behavior change in a restart scenario
(which is unacceptable for downstream stable releases).

I'm trying to understand the concern given that cloud-init already behaves this way and folks are currently using it this way; and presenting concerns about this behavior going away.

Could you expand on what potential (or existing) scenarios would be problematic by leaving the current capabilities in place?

In most cases it probably does what you might expect. Maybe a label such as
"unstable" or "experimental" or "expert mode" or something would be more
appropriate?

I don't think users stumble upon the --files mode; rather they are scratching a specific itch so I don't think a label will dissuade folks from using it.

First boot initialization is cloud-init's bread and butter and it is also where cloud-init really shines. Cloud-init provides features which overlap heavily with configuration management tools such as cfengine, puppet, chef, etc. Alternative CI tools are not a fair 1:1 comparison to cloud-init, because cloud-init's core offering is first boot configuration, however cloud-init can be used more as a configuration management tool using this feature. "Apply this configuration to an already running system" is one of the most basic tasks of configuration management tooling, and --files with the cli commands is cloud-init's answer to that. If a new user wants to compare how to do that on cloud-init vs other tools, their experience with cloud-init is not likely to be a pleasant one. If a user wants to apply user-data, it requires an understanding of cloud-init's internals which is unreasonable to expect of a new user.

Manually applying user-data after initial boot has many potentially surprising outcomes, including the following handful:

  • configures the system automatically in many ways that are not governed by user-defined configuration
  • behaves differently from one platform to the next due behaviors driven by datasource implementation code and vendor-data
  • behaves differently from one run to the next due to behaviors being gated on frequency
  • requires a user to run anywhere between three and six commands to apply all of user-data (depending on system state and what the user wants to happen)
  • isn't well documented in tutorials, howtos, or explanation docs, and doesn't have supported debugging instructions

I appreciate that users that already know how cloud-init works want to use cloud-init as their tool of choice - I don't want to stop them from doing so. I also think that cloud-init has potential to provide a much improved configuration management user experience over current state, especially with respect to the points listed above. A potential future with a more complete offering would include a UI that is intuitive, ergonomic, and well documented. This would also require configuration management behaviors that are more foolproof, intuitive, and have reasonable debugging workflows. A well-designed tool should not require the user to understand implementation details in order to use it. In the long term, I think that cloud-init can do better than just continuing to scratch the itches of those existing users that are using its existing --files offering, and instead seek to provide existing cloud-init offering into one that more reliably exhibits intuitive behaviors.

If you label it one of the above, what does that mean for upstream (and potentually downstream)? Will bug reports be rejected as WONTFIX? Some clarity with what upstream is planning here will be helpful users of the current capabilities.

It would also be helpful to understand what, if anything, would be prevented/restricted/ if the feature is not removed; ie, are --files and cloud-init single holding new features/functions back? What are the trade-offs under discussion (if any)?

I need to shift focus to other things for now, so I'm going to momentarily sidestep the maintenance-related questions. We aren't enacting any immediate changes, and the community's voice is important before we do.

@raharper
Copy link
Collaborator

Hey @OddBloke thanks for engaging.

Err @raharper ; no worries.

We aren't enacting any immediate changes, and the community's voice is
important before we do.

Great. My initial read on this thread was that --files was going to
go away without a replacement. I look forward to the discussion on
improvements.

@holmanb
Copy link
Member

holmanb commented Sep 18, 2024

Hey @OddBloke thanks for engaging.

Err @raharper ; no worries.

/facepalm

Oops, sorry

We aren't enacting any immediate changes, and the community's voice is
important before we do.

Great. My initial read on this thread was that --files was going to
go away without a replacement. I look forward to the discussion on
improvements.

Agreed, thanks for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants