Cleaning up Mesos from paasta readthedocs - PAASTA-18313 (#3954)

* Cleaning up Mesos from paasta readthedocs * Address reviews * Address more reviews * Addressing yelpsoa files reviews
Yelp · Oct 16, 2024 · f9e83e6 · f9e83e6
1 parent 979ea52
commit f9e83e6
Show file tree

Hide file tree

Showing 16 changed files with 213 additions and 718 deletions.
diff --git a/docs/source/about/glossary.rst b/docs/source/about/glossary.rst
@@ -1,37 +1,65 @@
 Glossary
 ========
 
-**App**
-~~~~~~~~
-
-Marathon app. A unit of configuration in Marathon. During normal
-operation, one service "instance" maps to one Marathon app, but during
-deploys there may be more than one app. Apps contain Tasks.
-
 **Docker**
 ~~~~~~~~~~
 
 Container `technology <https://www.docker.com/whatisdocker/>`_ that
 PaaSTA uses.
 
+**Kubernetes**
+~~~~~~~~~~~~~~
+
+`Kubernetes <https://kubernetes.io/>`_ (a.k.a. k8s) is the open-source system on which Yelp runs many compute workloads.
+In Kubernetes, tasks are distributed to and run by servers called Kubelets (but a.k.a. kube nodes or Kubernetes agents) from the Kubernetes control plane.
+
+**Kubernetes Deployment**
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A Kubernetes resource that represents a collection of pods running the same application. A Deployment is responsible for creating and updating instances of your application.
+
+**Kubernetes Node**
+~~~~~~~~~~~~~~~~~~~
+
+A node is a worker machine in a Kubernetes cluster that runs Pods.
+In our case, it's usually a virtual machine provisioned via AWS EC2 Fleets or AutoScalingGroups
+
+**Kubernetes Horizontal Pod Autoscaler (HPA)**
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A Kubernetes feature that automatically scales the number of pods in a deployment based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics).
+
 **clustername**
 ~~~~~~~~~~~~~~~
 
 A shortname used to describe a PaaSTA cluster. Use \`paasta
 list-clusters\` to see them all.
 
+**Kubernetes Pod**
+~~~~~~~~~~~~~~~~~~~
+
+Atomic deployment unit for PaaSTA workloads at Yelp and all Kubernetes clusters. Can be thought of as a collection of 1 or more related containers.
+Pods can be seen as one or more containers that share a network namespace, at Yelp these are individual instances of one of our services, many can run on each server.
+
+**Kubernetes Namespace**
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+It provides a mechanism for isolating groups of resources within a single cluster. Each K8s Namespace can contain resources like
+Pods and Deployments, and it allows for management and access controls to be applied at the Namespace level.
+
 **instancename**
 ~~~~~~~~~~~~~~~~
 
-Logical collection of Mesos tasks that comprise a Marathon app. service
-name + instancename = Marathon app name. Examples: main, canary.
+Logical collection of Kubernetes pods that comprise an application (a Kubernetes Deployment) deployed on Kubernetes. service
+name + instancename = Kubernetes Deployment. Examples: main, canary. Each instance represents a running
+version of a service with its own configuration and resources.
 
 **namespace**
 ~~~~~~~~~~~~~
 
 An haproxy/SmartStack concept grouping backends that listen on a
-particular port. A namespace may route to many healthy Marathon
-instances. By default, the namespace in which a Marathon job appears is
+particular port. A namespace may route to many healthy PaaSTA
+instances. By default, the namespace in which a PaaSTA instance appears is
 its instancename.
 
 **Nerve**
@@ -40,32 +68,6 @@ its instancename.
 A service announcement `daemon <https://github.com/airbnb/nerve>`_
 that registers services in zookeeper to be discovered.
 
-**Marathon**
-~~~~~~~~~~~~
-
-A `Mesos Framework <https://mesosphere.github.io/marathon/>`_
-designed to deploy stateless services.
-
-**Mesos**
-~~~~~~~~~
-
-A `Cluster/Scheduler <http://mesos.apache.org/>`_ that interacts
-with other `Framework <https://docs.mesosphere.com/frameworks/>`_
-software to run things on nodes.
-
-**Mesos Master**
-~~~~~~~~~~~~~~~~
-
-A machine running a Mesos Master process, responsible for coordination
-but not responsible for actually running Marathon or Tron jobs. There
-are several Masters, coordinating as a quorum via Zookeeper.
-
-**Mesos Slave**
-~~~~~~~~~~~~~~~
-
-A machine running a Mesos Slave process, responsible for running
-Marathon or Tron jobs as assigned by the Mesos Master.
-
 **PaaSTA**
 ~~~~~~~~~~
 
@@ -87,12 +89,6 @@ The brand name for Airbnb’s Nerve + Synapse service discovery solution.
 
 A local haproxy daemon that runs on yocalhost
 
-**Task**
-~~~~~~~~
-
-Marathon task. A process (usually inside a Docker container) running on
-a machine (a Mesos Slave). One or more Tasks constitutes an App.
-
 **soa-configs**
 ~~~~~~~~~~~~~~~
 
@@ -107,5 +103,5 @@ services.
 **Zookeeper**
 ~~~~~~~~~~~~~
 
-A distributed key/value store used by Mesos for coordination and
+A distributed key/value store used by PaaSTA for coordination and
 persistence.
diff --git a/docs/source/about/paasta_principles.rst b/docs/source/about/paasta_principles.rst
@@ -54,7 +54,7 @@ a particular app in a theoretical PaaS:
 +=============================================+=====================================+
 | ::                                          | ::                                  |
 |                                             |                                     |
-|   $ cat >marathon-cluster.yaml <<EOF        |                                     |
+|   $ cat >kubernetes-cluster.yaml <<EOF        |                                     |
 |   web:                                      |                                     |
 |     env:                                    |                                     |
 |       PRODUCTION: true                      |   $ paas config:set PRODUCTION=true |

diff --git a/docs/source/about/smartstack_interaction.rst b/docs/source/about/smartstack_interaction.rst
@@ -1,15 +1,13 @@
-How PaaSTA Interacts with SmartStack
-====================================
+SmartStack Service Discovery and PaaSTA Integration
+===================================================
 
-PaaSTA uses SmartStack configuration to influence the **deployment** and
-**monitoring** of services. This document assumes some prior knowledge
-about SmartStack; see http://nerds.airbnb.com/smartstack-service-discovery-cloud/.
+This document assumes some prior knowledge about SmartStack; see http://nerds.airbnb.com/smartstack-service-discovery-cloud/ for more information.
 
 .. contents:: Table of Contents
    :depth: 2
 
-How SmartStack Settings Influence Deployment
---------------------------------------------
+SmartStack Service Discovery and Latency Zones
+----------------------------------------------
 
 In SmartStack, a service can be configured to be *discovered* at a particular
 latency zone.
@@ -35,143 +33,12 @@ A-C. This is great for latency -- only talk to habitats that are
 topographically "nearby" -- but reduces availability since only three habitats
 can be reached.
 
-What Would Happen if PaaSTA Were Not Aware of SmartStack
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-PaaSTA uses `Marathon <https://mesosphere.github.io/marathon/>`_ to deploy
-long-running services. At Yelp, PaaSTA clusters are deployed at the
-``superregion`` level. This means that a service could potentially be deployed
-on any available host in that ``superregion`` that has resources to run it. If
-PaaSTA were unaware of the Smartstack ``discover:`` settings, Marathon would
-naively deploy tasks in a potentially "unbalanced" manner:
-
-.. image:: unbalanced_distribution.svg
-   :width: 700px
-
-With the naive approach, there is a total of six tasks for the superregion, but
-four landed in ``region 1``, and two landed in ``region 2``. If
-the ``discover`` setting were set to ``habitat``, there would be habitats
-**without** tasks available to serve anything, likely causing an outage.
-
-In a world with configurable SmartStack discovery settings, the deployment
-system (Marathon) must be aware of these and deploy accordingly.
-
-What A SmartStack-Aware Deployment Looks Like
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-By taking advantage of
-`Marathon Constraint Language <https://mesosphere.github.io/marathon/docs/constraints.html>`_
-, specifically the
-`GROUP_BY <https://mesosphere.github.io/marathon/docs/constraints.html#group_by-operator>`_
-operator, Marathon can deploy tasks in such a way as to ensure a balanced number
-of tasks in each latency zone.
-
-Example: Balanced deployment to every habitat
-*********************************************
-
-For example, if the SmartStack setting
-were ``discover: habitat`` [1]_, we Marathon could enforce the constraint
-``["habitat", "GROUP_BY"]``, which will ask Marathon to distribute tasks
-evenly between the habitats[2]_:
-
-.. image:: balanced_distribution.svg
-   :width: 700px
-
-Example: Deployment balanced to each region
-*******************************************
-
-Similarly, if the ``discover`` setting were set to ``region``, the equivalent
-Marathon constraint would ensure an equal number of tasks distributed to each region.
-
-.. image:: balanced_distribution_region.svg
-   :width: 700px
-
-Even though there some habitats in this diagram that lack the service, the
-``discover: region`` setting allows clients to utilize *any* process as long
-as it is in the local region. The Marathon constraint of ``["region", "GROUP_BY"]``
-ensures that tasks are distributed equally over the regions, in this case three
-in each.
-
-
-.. [1] Technically PaaSTA should be using the smallest value of the ``advertise``
-   setting, tracked in `PAASTA-1253 <https://jira.yelpcorp.com/browse/PAASTA-1253>`_.
-.. [2] Currently the ``instances:`` count represents the total number of
-   instances in the cluster. Eventually with `PAASTA-1254  <https://jira.yelpcorp.com/browse/PAASTA-1254>`_
-   the instance count will be a per-discovery-location setting, meaning there
-   will always be an equal number of instances per location. (With ``instances: 6``
-   and a ``discovery: habitat``, and three habitats, the total task count would be
-   18, 6 in each habitat.)
-
-
-How SmartStack Settings Influence Monitoring
---------------------------------------------
-
-If a service is in SmartStack, PaaSTA uses the same ``discover`` setting
-referenced above to decide how the service should be monitored. When a service
-author sets a particular setting, say ``discover: region``, it implies that the
-system should enforce availability of that service in every region. If there
-are regions that lack tasks to serve that service, then PaaSTA should alert.
-
-Example: Checking Each Habitat When ``discover: habitat``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If SmartStack is configured to ``discover: habitat``, PaaSTA configures
-Marathon to balance tasks to each habitat. But what if it is unable to do that?
-
-.. image:: replication_alert_habitat.svg
-   :width: 700px
-
-In this case, there are no tasks in habitat F. This is a problem because
-``discover: habitat`` implies that any clients in habitat F will not
-be able to find the service. It is *down* in habitat F.
-
-To detect and alert on this, PaaSTA uses the ``discover`` setting to decide
-which unique locations to look at (e.g. ``habitat``). Paasta iterates over
-each unique location (e.g. habitats A-F) and inspects the replication levels
-in each location. It finds that there is at least one habitat with too few
-instances (habitat F, which has 0 out of 1) and alerts.
-
-The output of the alert or ``paasta status`` looks something like this::
-
-    Smartstack:
-        habitatA - Healthy - in haproxy with (1/1) total backends UP in this namespace.
-        habitatB - Healthy - in haproxy with (1/1) total backends UP in this namespace.
-        habitatC - Healthy - in haproxy with (1/1) total backends UP in this namespace.
-        habitatD - Healthy - in haproxy with (1/1) total backends UP in this namespace.
-        habitatE - Healthy - in haproxy with (1/1) total backends UP in this namespace.
-        habitatF - Critical - in haproxy with (0/1) total backends UP in this namespace.
-
-In this case the service authors have a few actions they can take:
-
-- Increase the total instance count to have more tasks per habitat.
-  (In this example, each habitat contains a single point of failure!)
-- Change the ``discovery`` setting to ``region`` to increase availability
-  at the cost of latency.
-- Investigate *why* tasks can't run in habitat F.
-  (Lack of resources? Improper configs? Missing service dependencies?)
-
-Example: Checking Each Region When ``discover: region``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If SmartStack is configured to ``discover: region``, PaaSTA configures
-Marathon to balance tasks to each region. But what if it is unable to launch
-all the tasks, but there were tasks running in that region?
-
-.. image:: replication_noalert_region.svg
-   :width: 700px
-
-The output of the alert or ``paasta status`` looks something like this::
-
-    Smartstack:
-        region1 - Healthy - in haproxy with (3/3) total backends UP in this namespace.
-        region2 - Warning - in haproxy with (2/3) total backends UP in this namespace.
-
-Assuming a threshold of 50%, an alert would not be sent to the team in this case.
-
-Even if some habitats do not have tasks for this service, ``discover: region``
-ensures that clients can be satisfied by tasks in the same region if not by
-tasks in the same habitat.
+PaaSTA's SmartStack Unawareness and Pod Spreading Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+PaaSTA is not natively aware of SmartStack, to make it aware or more specifically Kubernetes scheduler aware, we can use Pod Topology Spread Contraints.
+To balance pods across Availability Zones (AZs) in Kubernetes, we use `topology spread contraints <https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/>`_. By using the key
+"topology_spread_constraints" in soa-configs to assign it for each instance of a service.
 
 The Relationship Between Nerve "namespaces" and PaaSTA "instances"
 ------------------------------------------------------------------
@@ -189,9 +56,9 @@ components of the same service on different ports. In PaaSTA we call these
     api:
         proxy_port: 20002
 
-The corresponding Marathon configuration in PaaSTA might look like this::
+The corresponding Kubernetes configuration in PaaSTA might look like this::
 
-    #marathon.yaml
+    #kubernetes.yaml
     main:
        instances: 10
        cmd: myserver.py
@@ -214,7 +81,7 @@ the same Nerve namespace. Consider this example::
     main:
         proxy_port: 20001
 
-    #marathon.yaml
+    #kubernetes.yaml
     main:
         instances: 10
         cmd: myserver.py
@@ -238,7 +105,7 @@ Sharding is another use case for using alternative namespaces::
     main:
         proxy_port: 20001
 
-    #marathon.yaml
+    #kubernetes.yaml
     shard1:
         instances: 10
         registrations: ['service.main']

diff --git a/docs/source/autoscaling.rst b/docs/source/autoscaling.rst
@@ -2,7 +2,7 @@
 Autoscaling PaaSTA Instances
 ====================================
 
-PaaSTA allows programmatic control of the number of replicas (pods) a service has.
+PaaSTA allows programmatic control of the number of replicas (Pods) a service has.
 It uses Kubernetes' Horizontal Pod Autoscaler (HPA) to watch a service's load and scale up or down.
 
 How to use autoscaling
@@ -24,9 +24,9 @@ This behavior may mean that your service is scaled up unnecessarily when you fir
 Don't worry - the autoscaler will soon learn what the actual load on your service is, and will scale back down to the appropriate level.
 
 If you use autoscaling it is highly recommended that you make sure your service has a readiness probe.
-If your service is registered in Smartstack, each pod automatically gets a readiness probe that checks whether that pod is available in the service mesh.
+If your service is registered in Smartstack, each Pod automatically gets a readiness probe that checks whether that Pod is available in the service mesh.
 Non-smartstack services may want to configure a ``healthcheck_mode``, and either ``healthcheck_cmd`` or  ``healthcheck_uri`` to ensure they have a readiness probe.
-The HPA will ignore the load on your pods between when they first start up and when they are ready.
+The HPA will ignore the load on your Pods between when they first start up and when they are ready.
 This ensures that the HPA doesn't incorrectly scale up due to this warm-up CPU usage.
 
 Autoscaling parameters are stored in an ``autoscaling`` attribute of your instances as a dictionary.
@@ -66,7 +66,7 @@ The currently available metrics providers are:
   Measures the CPU usage of your service's container.
 
 :uwsgi:
-  With the ``uwsgi`` metrics provider, Paasta will configure your pods to be scraped from your uWSGI master via its `stats server <http://uwsgi-docs.readthedocs.io/en/latest/StatsServer.html>`_.
+  With the ``uwsgi`` metrics provider, Paasta will configure your Pods to be scraped from your uWSGI master via its `stats server <http://uwsgi-docs.readthedocs.io/en/latest/StatsServer.html>`_.
   We currently only support uwsgi stats on port 8889, and Prometheus will attempt to scrape that port.
 
   .. note::
@@ -75,7 +75,7 @@ The currently available metrics providers are:
 
 
 :gunicorn:
-  With the ``gunicorn`` metrics provider, Paasta will configure your pods to run an additional container with the `statsd_exporter <https://github.com/prometheus/statsd_exporter>`_ image.
+  With the ``gunicorn`` metrics provider, Paasta will configure your Pods to run an additional container with the `statsd_exporter <https://github.com/prometheus/statsd_exporter>`_ image.
   This sidecar will listen on port 9117 and receive stats from the gunicorn service. The ``statsd_exporter`` will translate the stats into Prometheus format, which Prometheus will scrape.
 
 :active-requests: