Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any interest in perfecting deployment tools? #33

Open
lessless opened this issue May 18, 2021 · 14 comments
Open

Is there any interest in perfecting deployment tools? #33

lessless opened this issue May 18, 2021 · 14 comments

Comments

@lessless
Copy link

Background

One of the core differentiators of OTP is HCU which is getting more and more important while stateful services such as those that are built with LiveView are getting more and more popular. At the same time, the tooling around them is outdated and not really maintained.

Objective(s)

I'd like to understand Erlef's stance on HCU and the chances of making them more accessible.

@lessless lessless added the Agenda Item Item to be discussed at WG meeting label May 18, 2021
@ferd
Copy link
Member

ferd commented May 18, 2021

Assuming HCU stands for Hot Code Upgrade, they are supported on the Erlang side of things via rebar3 and relx, but they've not been as thoroughly supported on the Elixir side and I can't talk to that side's intents.

There's sure room for significant work in making relups as they stand easier to use. We can for sure discuss this in more details at the next meeting if you'd like for this to remain an agenda item to be discussed in person.

This is certainly the right working group for this.

@josevalim
Copy link
Contributor

Speaking from the Elixir side, this is something we would love to see tooling for. Perhaps a starting point is to extract what is in Distillery into something that can be used with mix release and, as that progresses, we can consider making it part of Elixir Core itself.

@starbelly
Copy link
Member

I'd love to see this and would be happy to get involved.

@tsloughter
Copy link
Collaborator

@lessless are you also referring to tools like edeliver? I spend some time in the Elixir slack's #deployment channel and that seems to cause issues for people with few who can help because people have moved to mix release from distillery or are using containers, in addition to edeliver being behind (unmaintained?) on keeping up with things like mix release.

I believe there are some good relup tools for rebar3 that are maintained -- if we want to start a list for rebar3 and mix I can go look those up -- but they've always been outside of rebar3 core itself so can fall behind as well.

I've long wanted to be able to spend time improving the dev UX for working with installing release upgrades in rebar3 (well technically in relx), but never had a job that used them so it hasn't been a priority. The one job I've had that did hot upgrades didn't use release upgrades :). To be clear, not something like edeliver, here I'm just referring to the scripts that rebar3 provides the user in their release that has commands like bin/<relname> upgrade <vsn>.

But if mix decides to add release upgrade support it would be good to work together on the install part to have a consistent dev user experience.

Hm, now that I think about it, sharing a common lib between mix and rebar3 for appup generation would probably be useful, so that is another area that there could be useful collaboration.

For deployment itself I've mostly spent my time on container related work and it has improved a lot in the last couple years (most the work was actually the OTP team, I just asked for the features :):

  • The VM now detects cgroup/cgroupv2 usage and set the # of active schedulers based on the CPU allotment
  • When a shutdown signal, like from Kubernetes, is sent the release will now properly shutdown with init:stop()
  • rebar3 generated release scripts were updated to make it easy to run in a --read-only container
  • Improvements to allow running a distributed node without EPMD easily without any third party library/tool
  • No more zombies when running rpc calls against a running node from within the container (had been a big issue for those who used bin/<relname> rpc <...> or bin/<relname> status in a monitoring check that ran them every 15 seconds and would eventually crash the container.
  • Some others I'm failing to recall at the moment...

I'm sure there is more to be done, but it may in large part be documentation.

Anyway... A couple initial steps if you are interested in this is to detail the scope and then collect both a list of existing tools and see if their maintainers are interested in being involved.

@lessless
Copy link
Author

@tsloughter Edeliver is maintained in "we accept your PRs" mode, also I will fix critical bugs.
I proposed a redesign https://elixirforum.com/t/thoughts-about-edeliver-2-0/19328 but couldn't get any feedback when I had time to do that.

Overall I imagine having an experience that @ferd described in https://ferd.ca/a-pipeline-made-of-airbags.html

You'd copy paste the script on the production instance you were on, call UpgradeNode(), see if it worked, then call RollingUpgrade(...) as aggressively or carefully as you thought was warranted. If you wanted, in a few milliseconds, dozens or hundreds of instances got live-deployed without losing a single connection. If you preferred, you could take it slow and do it in stages and carefully monitor things.

The pinnacle of HCU in Edelive are relup scripts, particularly https://github.com/edeliver/edeliver/blob/master/lib/edeliver/relup/phoenix_modification.ex

Making mix release support them would be what I can see as a total success.

@ferd
Copy link
Member

ferd commented May 20, 2021

I've had ideas for a while for a small agent that could run within a release and do something somewhat similar to Nerves apparently does. Build your release tarballs (with relups), put them in a registry, and let the agent running your production node go into the registry and fetch signed packages it can then apply live (even from within a docker container) from any sort of programmatic interface.

The idea for me would be to find a way to bring the benefits of hot code loading into the world of immutable infrastructure so you can eat your cake and have it.

@lessless
Copy link
Author

@ferd this seems to open a potential attack surface and from your talks, I know that you worked in regulated fields.
how do you conceiving mitigating security risks? should that registry be self-hosted?

@ferd
Copy link
Member

ferd commented May 20, 2021

Let's imagine the following case for amazon deployments:

  1. you bundle the agent app in your release the way you bundle SASL (the agent must depend on SASL anyway)
  2. your CI pipeline builds your docker images and creates containers they push onto a registry like they do today. One of the step is to create the .tar release (and unpack it) for the final image
  3. an extra build step will take that tarball, wrap it in a signed (or encrypted, or both) envelope with credentials given to the CI pipeline
  4. the signed tarball is uploaded to a private S3 bucket
  5. the runtime instances (running the pods) are given read access to the S3 buckets and configured to have the read key/validation key for the releases
  6. upon any call to "fetch latest" or "fetch $version" to the agent's handler (which you can automate from an RPC call, some cron, some internal timer, or whatever), the latest copy is gotten from the bucket, downloaded to /tmp, and verified
  7. if the signature is valid, the release is unpacked. It can then be run and/or installed.
  8. you have hot code updates!

You could really keep the registry to be S3-specific, but could really have adapters for any storage or a custom server. Even the signing could be deferred to KMS (either in AWS or GCP) but having a homebrew thing means you can test further with limited costs. OTOH, standard cloud-provided mechanisms usually have known patterns to work from with terraform or whatever other tools DevOps teams prefer these days.

In any case, the "safety" comes from the signing and access control. This mechanism could arguably work with embedded devices with access to the network as well. There's definitely a higher attack surface possible, but the stuff I outlined above has a risk factor that's similar to everything else someone would already use for their deployments in the cloud.

@lessless
Copy link
Author

Terrific! Absolutely love how it combines a foundation already provided by OTP and novel developments. A great mix!
If we then come up with the orchestration protocol to run migrations, hooks, and rollbacks that would be an absolute game-changer for many of us!

@ferd
Copy link
Member

ferd commented May 20, 2021

Yeah, it takes care of a few things:

  1. the mechanism of sending and unpacking the release, which is manual and annoying for folks
  2. the way of making it safe (from a security perspective)
  3. Making it compose with modern infrastructure.

A key challenge IMO is going to be able to figure out a way to have CI tests for "can the relup be safely applied" because that's where a huge part of the cost remains; relups come with a sort of change of habits of how you structure code changes to make lots of smaller continuous deployments rather than big-bang releases, and providing a short feedback loop would be essential for that.

@lessless
Copy link
Author

I think we can try to leverage bootleg’s experience with integration tests https://github.com/elixir-deploy/bootleg/blob/master/test/bootleg_functional_test.exs#L108-L125 (this is my fork of labzero/bootleg where I plan to merge in some changes)

@ferd ferd removed the Agenda Item Item to be discussed at WG meeting label Jan 4, 2022
@lessless
Copy link
Author

Hi guys,

The more I think about it, the more I believe it can be a killer feature for an IoT sector, feel Nerves and GRiSP.
Is there any meeting that I can join to discuss the work that needs to be done and what support I can get along with that?

@tsloughter
Copy link
Collaborator

We have monthly meetings https://erlef.org/wg/build-and-packaging/calendar

It might also be something you want to bring up with the Embedded WG https://erlef.org/wg/embedded

@lessless
Copy link
Author

@tsloughter thank you, I'll try to join the next one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants