diff --git a/docs/source/about/glossary.rst b/docs/source/about/glossary.rst index 1bfd27857c..1ab62159a0 100644 --- a/docs/source/about/glossary.rst +++ b/docs/source/about/glossary.rst @@ -1,37 +1,65 @@ Glossary ======== -**App** -~~~~~~~~ - -Marathon app. A unit of configuration in Marathon. During normal -operation, one service "instance" maps to one Marathon app, but during -deploys there may be more than one app. Apps contain Tasks. - **Docker** ~~~~~~~~~~ Container `technology `_ that PaaSTA uses. +**Kubernetes** +~~~~~~~~~~~~~~ + +`Kubernetes `_ (a.k.a. k8s) is the open-source system on which Yelp runs many compute workloads. +In Kubernetes, tasks are distributed to and run by servers called Kubelets (but a.k.a. kube nodes or Kubernetes agents) from the Kubernetes control plane. + +**Kubernetes Deployment** +~~~~~~~~~~~~~~~~~~~~~~~~~ + +A Kubernetes resource that represents a collection of pods running the same application. A Deployment is responsible for creating and updating instances of your application. + +**Kubernetes Node** +~~~~~~~~~~~~~~~~~~~ + +A node is a worker machine in a Kubernetes cluster that runs Pods. +In our case, it's usually a virtual machine provisioned via AWS EC2 Fleets or AutoScalingGroups + +**Kubernetes Horizontal Pod Autoscaler (HPA)** +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A Kubernetes feature that automatically scales the number of pods in a deployment based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). + **clustername** ~~~~~~~~~~~~~~~ A shortname used to describe a PaaSTA cluster. Use \`paasta list-clusters\` to see them all. +**Kubernetes Pod** +~~~~~~~~~~~~~~~~~~~ + +Atomic deployment unit for PaaSTA workloads at Yelp and all Kubernetes clusters. Can be thought of as a collection of 1 or more related containers. +Pods can be seen as one or more containers that share a network namespace, at Yelp these are individual instances of one of our services, many can run on each server. + +**Kubernetes Namespace** +~~~~~~~~~~~~~~~~~~~~~~~~ + +It provides a mechanism for isolating groups of resources within a single cluster. Each K8s Namespace can contain resources like +Pods and Deployments, and it allows for management and access controls to be applied at the Namespace level. + **instancename** ~~~~~~~~~~~~~~~~ -Logical collection of Mesos tasks that comprise a Marathon app. service -name + instancename = Marathon app name. Examples: main, canary. +Logical collection of Kubernetes pods that comprise an application (a Kubernetes Deployment) deployed on Kubernetes. service +name + instancename = Kubernetes Deployment. Examples: main, canary. Each instance represents a running +version of a service with its own configuration and resources. **namespace** ~~~~~~~~~~~~~ An haproxy/SmartStack concept grouping backends that listen on a -particular port. A namespace may route to many healthy Marathon -instances. By default, the namespace in which a Marathon job appears is +particular port. A namespace may route to many healthy PaaSTA +instances. By default, the namespace in which a PaaSTA instance appears is its instancename. **Nerve** @@ -40,32 +68,6 @@ its instancename. A service announcement `daemon `_ that registers services in zookeeper to be discovered. -**Marathon** -~~~~~~~~~~~~ - -A `Mesos Framework `_ -designed to deploy stateless services. - -**Mesos** -~~~~~~~~~ - -A `Cluster/Scheduler `_ that interacts -with other `Framework `_ -software to run things on nodes. - -**Mesos Master** -~~~~~~~~~~~~~~~~ - -A machine running a Mesos Master process, responsible for coordination -but not responsible for actually running Marathon or Tron jobs. There -are several Masters, coordinating as a quorum via Zookeeper. - -**Mesos Slave** -~~~~~~~~~~~~~~~ - -A machine running a Mesos Slave process, responsible for running -Marathon or Tron jobs as assigned by the Mesos Master. - **PaaSTA** ~~~~~~~~~~ @@ -87,12 +89,6 @@ The brand name for Airbnb’s Nerve + Synapse service discovery solution. A local haproxy daemon that runs on yocalhost -**Task** -~~~~~~~~ - -Marathon task. A process (usually inside a Docker container) running on -a machine (a Mesos Slave). One or more Tasks constitutes an App. - **soa-configs** ~~~~~~~~~~~~~~~ @@ -107,5 +103,5 @@ services. **Zookeeper** ~~~~~~~~~~~~~ -A distributed key/value store used by Mesos for coordination and +A distributed key/value store used by PaaSTA for coordination and persistence. diff --git a/docs/source/about/paasta_principles.rst b/docs/source/about/paasta_principles.rst index ee7fbe404c..7ad5baac39 100644 --- a/docs/source/about/paasta_principles.rst +++ b/docs/source/about/paasta_principles.rst @@ -54,7 +54,7 @@ a particular app in a theoretical PaaS: +=============================================+=====================================+ | :: | :: | | | | -| $ cat >marathon-cluster.yaml <kubernetes-cluster.yaml <`_ to deploy -long-running services. At Yelp, PaaSTA clusters are deployed at the -``superregion`` level. This means that a service could potentially be deployed -on any available host in that ``superregion`` that has resources to run it. If -PaaSTA were unaware of the Smartstack ``discover:`` settings, Marathon would -naively deploy tasks in a potentially "unbalanced" manner: - -.. image:: unbalanced_distribution.svg - :width: 700px - -With the naive approach, there is a total of six tasks for the superregion, but -four landed in ``region 1``, and two landed in ``region 2``. If -the ``discover`` setting were set to ``habitat``, there would be habitats -**without** tasks available to serve anything, likely causing an outage. - -In a world with configurable SmartStack discovery settings, the deployment -system (Marathon) must be aware of these and deploy accordingly. - -What A SmartStack-Aware Deployment Looks Like -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -By taking advantage of -`Marathon Constraint Language `_ -, specifically the -`GROUP_BY `_ -operator, Marathon can deploy tasks in such a way as to ensure a balanced number -of tasks in each latency zone. - -Example: Balanced deployment to every habitat -********************************************* - -For example, if the SmartStack setting -were ``discover: habitat`` [1]_, we Marathon could enforce the constraint -``["habitat", "GROUP_BY"]``, which will ask Marathon to distribute tasks -evenly between the habitats[2]_: - -.. image:: balanced_distribution.svg - :width: 700px - -Example: Deployment balanced to each region -******************************************* - -Similarly, if the ``discover`` setting were set to ``region``, the equivalent -Marathon constraint would ensure an equal number of tasks distributed to each region. - -.. image:: balanced_distribution_region.svg - :width: 700px - -Even though there some habitats in this diagram that lack the service, the -``discover: region`` setting allows clients to utilize *any* process as long -as it is in the local region. The Marathon constraint of ``["region", "GROUP_BY"]`` -ensures that tasks are distributed equally over the regions, in this case three -in each. - - -.. [1] Technically PaaSTA should be using the smallest value of the ``advertise`` - setting, tracked in `PAASTA-1253 `_. -.. [2] Currently the ``instances:`` count represents the total number of - instances in the cluster. Eventually with `PAASTA-1254 `_ - the instance count will be a per-discovery-location setting, meaning there - will always be an equal number of instances per location. (With ``instances: 6`` - and a ``discovery: habitat``, and three habitats, the total task count would be - 18, 6 in each habitat.) - - -How SmartStack Settings Influence Monitoring --------------------------------------------- - -If a service is in SmartStack, PaaSTA uses the same ``discover`` setting -referenced above to decide how the service should be monitored. When a service -author sets a particular setting, say ``discover: region``, it implies that the -system should enforce availability of that service in every region. If there -are regions that lack tasks to serve that service, then PaaSTA should alert. - -Example: Checking Each Habitat When ``discover: habitat`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If SmartStack is configured to ``discover: habitat``, PaaSTA configures -Marathon to balance tasks to each habitat. But what if it is unable to do that? - -.. image:: replication_alert_habitat.svg - :width: 700px - -In this case, there are no tasks in habitat F. This is a problem because -``discover: habitat`` implies that any clients in habitat F will not -be able to find the service. It is *down* in habitat F. - -To detect and alert on this, PaaSTA uses the ``discover`` setting to decide -which unique locations to look at (e.g. ``habitat``). Paasta iterates over -each unique location (e.g. habitats A-F) and inspects the replication levels -in each location. It finds that there is at least one habitat with too few -instances (habitat F, which has 0 out of 1) and alerts. - -The output of the alert or ``paasta status`` looks something like this:: - - Smartstack: - habitatA - Healthy - in haproxy with (1/1) total backends UP in this namespace. - habitatB - Healthy - in haproxy with (1/1) total backends UP in this namespace. - habitatC - Healthy - in haproxy with (1/1) total backends UP in this namespace. - habitatD - Healthy - in haproxy with (1/1) total backends UP in this namespace. - habitatE - Healthy - in haproxy with (1/1) total backends UP in this namespace. - habitatF - Critical - in haproxy with (0/1) total backends UP in this namespace. - -In this case the service authors have a few actions they can take: - -- Increase the total instance count to have more tasks per habitat. - (In this example, each habitat contains a single point of failure!) -- Change the ``discovery`` setting to ``region`` to increase availability - at the cost of latency. -- Investigate *why* tasks can't run in habitat F. - (Lack of resources? Improper configs? Missing service dependencies?) - -Example: Checking Each Region When ``discover: region`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If SmartStack is configured to ``discover: region``, PaaSTA configures -Marathon to balance tasks to each region. But what if it is unable to launch -all the tasks, but there were tasks running in that region? - -.. image:: replication_noalert_region.svg - :width: 700px - -The output of the alert or ``paasta status`` looks something like this:: - - Smartstack: - region1 - Healthy - in haproxy with (3/3) total backends UP in this namespace. - region2 - Warning - in haproxy with (2/3) total backends UP in this namespace. - -Assuming a threshold of 50%, an alert would not be sent to the team in this case. - -Even if some habitats do not have tasks for this service, ``discover: region`` -ensures that clients can be satisfied by tasks in the same region if not by -tasks in the same habitat. +PaaSTA's SmartStack Unawareness and Pod Spreading Strategy +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +PaaSTA is not natively aware of SmartStack, to make it aware or more specifically Kubernetes scheduler aware, we can use Pod Topology Spread Contraints. +To balance pods across Availability Zones (AZs) in Kubernetes, we use `topology spread contraints `_. By using the key +"topology_spread_constraints" in soa-configs to assign it for each instance of a service. The Relationship Between Nerve "namespaces" and PaaSTA "instances" ------------------------------------------------------------------ @@ -189,9 +56,9 @@ components of the same service on different ports. In PaaSTA we call these api: proxy_port: 20002 -The corresponding Marathon configuration in PaaSTA might look like this:: +The corresponding Kubernetes configuration in PaaSTA might look like this:: - #marathon.yaml + #kubernetes.yaml main: instances: 10 cmd: myserver.py @@ -214,7 +81,7 @@ the same Nerve namespace. Consider this example:: main: proxy_port: 20001 - #marathon.yaml + #kubernetes.yaml main: instances: 10 cmd: myserver.py @@ -238,7 +105,7 @@ Sharding is another use case for using alternative namespaces:: main: proxy_port: 20001 - #marathon.yaml + #kubernetes.yaml shard1: instances: 10 registrations: ['service.main'] diff --git a/docs/source/autoscaling.rst b/docs/source/autoscaling.rst index d4d10ab35e..821f8f3696 100644 --- a/docs/source/autoscaling.rst +++ b/docs/source/autoscaling.rst @@ -2,7 +2,7 @@ Autoscaling PaaSTA Instances ==================================== -PaaSTA allows programmatic control of the number of replicas (pods) a service has. +PaaSTA allows programmatic control of the number of replicas (Pods) a service has. It uses Kubernetes' Horizontal Pod Autoscaler (HPA) to watch a service's load and scale up or down. How to use autoscaling @@ -24,9 +24,9 @@ This behavior may mean that your service is scaled up unnecessarily when you fir Don't worry - the autoscaler will soon learn what the actual load on your service is, and will scale back down to the appropriate level. If you use autoscaling it is highly recommended that you make sure your service has a readiness probe. -If your service is registered in Smartstack, each pod automatically gets a readiness probe that checks whether that pod is available in the service mesh. +If your service is registered in Smartstack, each Pod automatically gets a readiness probe that checks whether that Pod is available in the service mesh. Non-smartstack services may want to configure a ``healthcheck_mode``, and either ``healthcheck_cmd`` or ``healthcheck_uri`` to ensure they have a readiness probe. -The HPA will ignore the load on your pods between when they first start up and when they are ready. +The HPA will ignore the load on your Pods between when they first start up and when they are ready. This ensures that the HPA doesn't incorrectly scale up due to this warm-up CPU usage. Autoscaling parameters are stored in an ``autoscaling`` attribute of your instances as a dictionary. @@ -66,7 +66,7 @@ The currently available metrics providers are: Measures the CPU usage of your service's container. :uwsgi: - With the ``uwsgi`` metrics provider, Paasta will configure your pods to be scraped from your uWSGI master via its `stats server `_. + With the ``uwsgi`` metrics provider, Paasta will configure your Pods to be scraped from your uWSGI master via its `stats server `_. We currently only support uwsgi stats on port 8889, and Prometheus will attempt to scrape that port. .. note:: @@ -75,7 +75,7 @@ The currently available metrics providers are: :gunicorn: - With the ``gunicorn`` metrics provider, Paasta will configure your pods to run an additional container with the `statsd_exporter `_ image. + With the ``gunicorn`` metrics provider, Paasta will configure your Pods to run an additional container with the `statsd_exporter `_ image. This sidecar will listen on port 9117 and receive stats from the gunicorn service. The ``statsd_exporter`` will translate the stats into Prometheus format, which Prometheus will scrape. :active-requests: diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst index 3264867a0b..b1d1133621 100644 --- a/docs/source/contributing.rst +++ b/docs/source/contributing.rst @@ -22,14 +22,14 @@ You can run ``make itest`` to execute them. Example Cluster ^^^^^^^^^^^^^^^^^ There is a docker compose configuration based on our itest containers that you -can use to run the paasta code against a semi-realistic cluster whilst you are +can use to run the PaaSTA code against a semi-realistic cluster whilst you are developing. More instructions `here <./installation/example_cluster.html>`_ System Package Building / itests ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PaaSTA is distributed as a debian package. This package can be built and tested -with ``make itest_xenial``. These tests make assertions about the +with ``make itest_``. These tests make assertions about the packaging implementation. @@ -71,12 +71,3 @@ it is a little tricky. * ``eval "$(.tox/py27/bin/register-python-argcomplete ./tox/py27/bin/paasta)"`` * There is a simple integration test. See the itest/ folder. - -Upgrading Components --------------------- - -As things progress, there will come a time that you will have to upgrade -PaaSTA components to new versions. - -* See `Upgrading Mesos `_ for how to upgrade Mesos safely. -* See `Upgrading Marathon `_ for how to upgrade Marathon safely. diff --git a/docs/source/generated/paasta_tools.rst b/docs/source/generated/paasta_tools.rst index 7ab576c7c9..de15c7ecf5 100644 --- a/docs/source/generated/paasta_tools.rst +++ b/docs/source/generated/paasta_tools.rst @@ -13,7 +13,6 @@ Subpackages paasta_tools.frameworks paasta_tools.instance paasta_tools.kubernetes - paasta_tools.mesos paasta_tools.metrics paasta_tools.monitoring paasta_tools.paastaapi @@ -71,9 +70,6 @@ Submodules paasta_tools.log_task_lifecycle_events paasta_tools.long_running_service_tools paasta_tools.mac_address - paasta_tools.marathon_dashboard - paasta_tools.mesos_maintenance - paasta_tools.mesos_tools paasta_tools.monitoring_tools paasta_tools.monkrelaycluster_tools paasta_tools.nrtsearchservice_tools diff --git a/docs/source/index.rst b/docs/source/index.rst index 2227f36658..4e9152eabc 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -34,7 +34,7 @@ PaaSTA Development .. toctree:: :maxdepth: 2 - paasta_development + PaaSTA_development contributing style_guide upgrading_marathon diff --git a/docs/source/installation/example_cluster.rst b/docs/source/installation/example_cluster.rst index a612783e22..2c7383da37 100644 --- a/docs/source/installation/example_cluster.rst +++ b/docs/source/installation/example_cluster.rst @@ -24,11 +24,6 @@ everything with ``docker-compose down && docker-compose run playground``. Getting Started --------------- -Mesos -~~~~~ -To launch a running Mesos cluster, then run ``docker-compose run playground`` -and you'll be dropped into a shell with the paasta\_tools package installed in development mode. - Kubernetes ~~~~~~~~~~ To instead launch a Kubernetes cluster, run @@ -47,9 +42,7 @@ Try it out The cluster includes a git remote and docker registry. The git remote contains an example repo but you can add more if you want. -The mesos and marathon webuis are exposed on your docker host -on port 5050, 8080, 8081. So load them up if you want to watch. Then in -the playground container: +In the playground container: :: @@ -63,9 +56,8 @@ the playground container: Scaling The Cluster ------------------- -If you want to add more capacity to the cluster, you can increase the number of Mesos agents/Kubernetes Nodes: +If you want to add more capacity to the cluster, you can increase the number of Kubernetes Nodes: -``docker-compose scale mesosslave=4`` or ``docker-compose scale kubernetes=4`` @@ -79,9 +71,8 @@ Some but not all of the paasta command line tools should work. Try: paasta status -s hello-world Scribe is not included with this example cluster. If you are looking for -logs, check ``/var/logs/paasta_logs`` and syslog on the mesosmaster for -the output from cron. Also note that all the slaves share the host's -docker daemon. +logs, check syslog on the Kubernetes node that the pod is running on for the output from cron. +You can get the host the pod is running on by adding "-v" to the command above. Cleanup ------- diff --git a/docs/source/installation/getting_started.rst b/docs/source/installation/getting_started.rst index 13d562de2d..0a2bfc555b 100644 --- a/docs/source/installation/getting_started.rst +++ b/docs/source/installation/getting_started.rst @@ -33,9 +33,7 @@ are currently not available, so one must build them and install them manually:: make itest_xenial sudo dpkg -i dist/paasta-tools*.deb -This package must be installed anywhere the PaaSTA CLI and on the Mesos/Marathon -masters. If you are using SmartStack for service discovery, then the package must -be installed on the Mesos Slaves as well so they can query the local API. +This package must be installed anywhere the PaaSTA CLI is needed and on the kube nodes. Once installed, ``paasta_tools`` reads global configuration from ``/etc/paasta/``. This configuration is in key/value form encoded as JSON. All files in ``/etc/paasta`` @@ -76,7 +74,7 @@ Docker and a Docker Registry PaaSTA uses `Docker `_ to build and distribute code for each service. PaaSTA assumes that a single registry is available and that the associated components -(Docker commands, unix users, mesos slaves, etc) have the correct credentials +(Docker commands, unix users, Kubernetes Nodes, etc) have the correct credentials to use it. The docker registry needs to be defined in a config file in ``/etc/paasta/``. @@ -91,34 +89,24 @@ filename is irrelevant, but here would be an example There are many registries available to use, or you can `host your own `_. -Mesos ------ - -PaaSTA uses Mesos to do the heavy lifting of running the actual services on -pools of machines. See the `official documentation `_ -on how to get started with Mesos. - -Marathon --------- +Kubernetes +---------- -PaaSTA uses `Marathon `_ for supervising long-running services running in Mesos. -See the `official documentation `__ for how to get started with Marathon. -Then, see the `PaaSTA documentation <../yelpsoa_configs.html#marathon-clustername-yaml>`_ for how to define Marathon -jobs. +PaaSTA uses `Kubernetes `_ to manage and orchestrate its containerized services. +See the `PaaSTA documentation <../yelpsoa_configs.html#kubernetes-clustername-yaml>`_ for how to define PaaSTA +services in Kubernetes. -Once Marathon jobs are defined in soa-configs, there are a few tools provided by PaaSTA -that interact with the Marathon API: +Once PaaSTA services are defined in soa-configs, there are a few tools provided by PaaSTA +that interact with the Kubernetes API: -* ``deploy_marathon_services``: Does the initial sync between soa-configs and the Marathon API. - This is the tool that handles "bouncing" to new version of code, and resizing Marathon applications when autoscaling +* ``setup_kubernetes_job``: Does the initial sync between soa-configs and the Kubernetes API. + This is the tool that handles "bouncing" to new version of code, and resizing Kubernetes deployments when autoscaling is enabled. - This is idempotent, and should be run periodically on a box with a ``marathon.json`` file in the - `system paasta config <../system_configs.html>`_ directory (Usually ``/etc/paasta``). - We recommend running this frequently - delays between runs of this command will limit how quickly new versions of - services or changes to soa-configs are picked up. -* ``cleanup_marathon_jobs``: Cleans up lost or abandoned services. This tool - looks for Marathon jobs that are *not* defined in soa-configs and removes them. -* ``check_marathon_services_replication``: Iterates over all Marathon services + This is idempotent, and is ran periodically on a box with a ``deployments.json`` file in the + ``/nail/etc/services`` directory, updating or creating the Kubernetes Deployment object representing the modified service instance. +* ``cleanup_kubernetes_jobs``: Cleans up lost or abandoned services. This tool + looks for Kubernetes instances that are *not* defined in soa-configs and removes them. +* ``check_kubernetes_services_replication``: Iterates over all Kubernetes services and inspects their health. This tool integrates with the monitoring infrastructure and will alert the team responsible for the service if it becomes unhealthy to the point where manual intervention is required. @@ -128,7 +116,7 @@ SmartStack and Hacheck `SmartStack `_ is a dynamic service discovery system that allows clients to find and route to -healthy mesos tasks for a particular service. +healthy Kubernetes Pods for a particular service. Smartstack consists of two agents: `nerve `_ and `synapse `_. Nerve is responsible for health-checking services and registering them in ZooKeeper. Synapse then reads that data from ZooKeeper and configures an HAProxy instance. @@ -137,7 +125,7 @@ To manage the configuration of nerve (detecting which services are running on a we have a package called `nerve-tools `_. This repo builds a .deb package, and should be installed on all slaves. Each slave should run ``configure_nerve`` periodically. -We recommend this runs quite frequently (we run it every 5s), since Marathon tasks created by Paasta are not available +We recommend this runs quite frequently (we run it every 5s), since Kubernetes Pods created by PaaSTA are not available to clients until nerve is reconfigured. Similarly, to manage the configuration of synapse, we have a package called `synapse-tools `_. diff --git a/docs/source/isolation.rst b/docs/source/isolation.rst index f118361f19..1f692db17b 100644 --- a/docs/source/isolation.rst +++ b/docs/source/isolation.rst @@ -1,27 +1,25 @@ ============================================== -Resource Isolation in PaaSTA, Mesos and Docker +Resource Isolation in PaaSTA, Kubernetes and Docker ============================================== PaaSTA instance definitions include fields that specify the required resources -for your service. The reason for this is two-fold: firstly, so that whichever -Mesos framework can evaluate which Mesos agent making -offers have enough capacity to run the task (and pick one of the agents -accordingly); secondly, so that tasks can be protected from especially noisy -neighbours on a box. That is, if a task under-specifies the resources it +for your service. The reason for this is two-fold: firstly, so that the Kubernetes scheduler +can evaluate which Kubernetes nodes have enough capacity to schedule the Kubernetes Pods (representing PaaSTA instances) on, in the cluster specified; +secondly, so that the Pods can be protected from especially noisy +neighbours on a box. That is, if a Pod under-specifies the resources it requires to run, or in another case, has a bug that means that it consumes far -more resources than it *should* require, then the offending tasks can be +more resources than it *should* require, then the offending Pods can be isolated effectively, preventing them from having a negative impact on its neighbours. -This document is designed to give a more detailed review of how Mesos -Frameworks such as Marathon use these requirements to run tasks on -different Mesos agents, and how these isolation mechanisms are implemented. +This document is designed to give a more detailed review of how Kubernetes +use these requirements to run Pods on different Kubernetes nodes, and how these isolation mechanisms are implemented. Note: Knowing the details of these systems isn't a requirement of using PaaSTA; most service authors may never need to know the details of such things. In fact, one of PaaSTA's primary design goals is to *hide* the minutiae of schedulers and resource isolation. However, this may benefit administrators -of PaaSTA (and, more generally, Mesos clusters), and the simply curious. +of PaaSTA (and, more generally, Kubernetes clusters), and the simply curious. Final note: The details herein may, nay, will contain (unintended) inaccuracies. If you notice such a thing, we'd be super grateful if you could open a pull @@ -31,64 +29,51 @@ How Tasks are Scheduled on Hosts -------------------------------- To first understand how these resources are used, one must understand how -a task is run on a Mesos cluster. - -Mesos can run in two modes: Master and Agent. When a node is running Mesos in -Master mode, it is responsible for communicating between agent processes and -frameworks. A Framework is a program which wants to run tasks on the Mesos -cluster. - -A master is responsible for presenting frameworks with resource offers. -Resource offers are compute resource free for a framework to run a task. The -details of that compute resource comes from the agent nodes, which regularly -tell the Master agent the resources it has available for running tasks. Using -the correct parlance, Mesos agents make 'offers' to the master. - -Once a master node receives offers from an agent, it forwards it to -a framework. Resource offers are split between frameworks according to -the master's configuration - there may be particular priority given -to some frameworks. - -At Yelp, we treat the frameworks we run (at the time of writing, Marathon and -Tron) equally. That means that frameworks *should* have offers distributed -between them evenly, and all tasks are considered equal. - -It is then up to the framework to decide what it wants to do with an offer. -The framework may decide to: - - * Reject the offer, if the framework has no tasks to run. - * Reject the offer, if the resources included in the offer are not enough to - match those required by the application. - * Reject the offer, if attributes on the slave conflict with any constraints - set by the task. - * Accept the offer, if there is a task that requires resources less than or - equal to the resources offered by the Agent. - -When rejecting an offer, the framework may apply a 'filter' to the offer. This -filter is then used by the Mesos master to ensure that it does *not* resend -offers that are 'filtered' by a framework. The default filter applied includes -a timeout - a Master will not resend an offer to a framework for a period of 5 -seconds. - -If a framework decides it wants to accept a resource offer, it then tells the -master to run a task on the agent. The details of the 'acceptance' include a -detail of the task to be run, and the 'executor' used to run the task. - -By default, PaaSTA uses the 'Docker' executor everywhere. This means that *all* -tasks launched by Marathon and Tron are done so with a Docker container. - -How Tasks are isolated from each other. ---------------------------------------- - -Given that a slave may run multiple tasks, we need to ensure that tasks cannot +a Pod is run on a Kubernetes cluster. + +Kubernetes has two types of nodes: Master and worker nodes. The master nodes are +responsible for managing the cluster. + +The master node contains the following components: + + * API Server: Exposes the Kubernetes API. It is the front-end for the Kubernetes control plane. + * Scheduler: Responsible for distributing workloads across multiple nodes. + * Controller Manager: Responsible for regulating the state of the cluster. + +Worker nodes are the machines that run the workload. Each worker node runs the following components +to manage the execution and networking of containers: + + * Kubelet: An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod. + * Kube-proxy: Maintains network rules on nodes. These network rules allow network communication to Pods from network sessions inside or outside of the cluster. + * Container runtime: The software that is responsible for running containers. Kubernetes supports several container runtimes: Docker, containerd, CRI-O, and any implementation of the Kubernetes CRI (Container Runtime Interface). + + +When a new Pod (representing a PaaSTA instance) is created, the Kubernetes scheduler (kube-scheduler) will assign it to the best node for it to run on. +The scheduler will take into account the resources required by the Pod, the resources available on the nodes, and any constraints that are specified. It takes the following +criteria into account when selecting a node to have the Pod run on: + + * Resource requirements: Checks if nodes have enough CPU, memory, and other resources requested by the Pod. + * Node affinity: Checks if the Pod should be scheduled on a node that has a specific label. + * Inter-Pod affinity/anti-affinity: checks if the Pod should be scheduled near or far from another Pod. + * Taints and tolerations: Checks if the Pod should be scheduled on a node that has a specific taint. + * Node selectors: Checks if the Pod should be scheduled on a node that has a specific label. + * Custom Policies: any custom scheduling policies or priorities such as the Pod Topology Spread Constraints set by the key "topology_spread_constraint". + +The scheduler will then score each node that can host the Pod, based on the criteria above and any custom policies and then select the node +with the highest score to run the Pod on. If multiple nodes have the same highest score then one of them is chosen randomly. Once a node is selected, the scheduler assigns +the Pod to the node and the decision is then communicated back to the API server, which in turn notifies the Kubelet on the chosen node to start the Pod. +For more information on how the scheduler works, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling/scheduling-framework/). + +How PaaSTA services are isolated from each other +------------------------------------------------ + +Given that a node may run multiple Pods for PaaSTA services, we need to ensure that Pods cannot 'interfere' with one another. We do this on a file system level using Docker - processes launched in Docker containers are protected from each other and the host by using kernel namespaces. Note that the use of kernel namespaces is a -feature of Docker - PaaSTA doesn't do anything 'extra' to enable this. It's -also worth noting that there are other 'container' technologies that could -provide this - the native Mesos 'containerizer' included. +feature of Docker - PaaSTA doesn't do anything 'extra' to enable this. -However, these tasks are still running and consuming resources on the same +However, these Pods are still running and consuming resources on the same host. The next section aims to explain how PaaSTA services are protected from so-called 'noisy neighbours' that can starve others from resources. @@ -130,21 +115,20 @@ If the processes in the cgroup reaches the ``memsw.limit_in_bytes`` value , then the kernel will invoke the OOM killer, which in turn will kill off one of the processes in the cgroup (often, but not always, this is the biggest contributor to the memory usage). If this is the only process running in the -Docker container, then the container will die. The mesos framework which -launched the task may or may not decide to try and start the same task -elsewhere. +Docker container, then the container will die. Kubernetes will restart the container +as the RestartPolicy for the container is set to "Always". CPUs """" CPU enforcement is implemented slightly differently. Many people expect the value defined in the ``cpus`` field in a service's soa-configs to map to a -number of cores that are reserved for a task. However, isolating CPU time like +number of cores that are reserved for a Pod. However, isolating CPU time like this can be particularly wasteful; unless a task spends 100% of its time on -CPU (and thus has *no* I/O), then there is no need to prevent other tasks from +CPU (and thus has *no* I/O), then there is no need to prevent other Pods from running on the spare CPU time available. -Instead, the CPU value is used to give tasks a relative priority. This priority +Instead, the CPU value is used to give Pods a relative priority. This priority is used by the Linux Scheduler decide the order in which to run waiting threads. @@ -170,17 +154,11 @@ Some notes on this: against the share available for another. The result of this may be that a higher number of 'skinny' containers may be preferable to 'fat' containers. -This is different from how Mesos and Marathon use the CPU value when evaluating -whether a task 'fits' on a host. Yelp configures agents to advertise the number -of cores on the box, and Marathon will only schedule containers on agents where -there is enough 'room' on the host, when in reality, there is no such limit. - Disk """"" -Unfortunately, the isolator provided by Mesos does not support isolating disk -space used by Docker containers; that is, we have no way of limiting the amount -of disk space used by a task. Our best effort is to ensure that the disk space -is part of the offer given by a given Mesos agent to frameworks, and ensure -that any services we know to use high disk usage (such as search indexes) have -the ``disk`` field set appropriately in their configuration. +Kubernetes supports disk resource isolation through the use of storage quotas. Kubernetes +will periodically poll for usage, so it is possible to temporarily exceed the configured +limit. When Kubernetes sees that a container has exceeded it's limit, it will evict (i.e., kill) the offending Pod, thereby deleting the containers filesystem and reclaiming the used disk. + +NOTE: this usage calculation takes into consideration node-level container logs (i.e., container logs for stdout/stderr stored on-host to power things like ``kubectl logs``) - if an application is particularly "chatty" with its output, the ``disk`` allocation in soa-configs will need to take this into account.``` diff --git a/docs/source/paasta_development.rst b/docs/source/paasta_development.rst index cf74c49bd3..a961fad452 100644 --- a/docs/source/paasta_development.rst +++ b/docs/source/paasta_development.rst @@ -107,7 +107,7 @@ If you didn't run ``setup_kubernetes_job`` to deploy ``compute-infra-test-servic 1. Using ``launch.json`` file - 1. From the ``Run and Debug`` tab in VS Code, press on ``paasta playground``. This will run all PaaSTA components. + 1. From the ``Run and Debug`` tab in VS Code, press on ``PaaSTA playground``. This will run all PaaSTA components. 2. Using make targets diff --git a/docs/source/soa_configs.rst b/docs/source/soa_configs.rst index 83054be5a4..932ea17777 100644 --- a/docs/source/soa_configs.rst +++ b/docs/source/soa_configs.rst @@ -22,15 +22,13 @@ directory. There is one folder per service. Here is an example tree:: ├── api │   ├── adhoc-prod.yaml │   ├── deploy.yaml - │   ├── marathon-dev.yaml - │   ├── marathon-prod.yaml │   ├── monitoring.yaml │   ├── service.yaml │   ├── smartstack.yaml │   └── tron-prod.yaml ... -See the `paasta-specific soa-configs documentation `_ for more information +See the `PaaSTA-specific soa-configs documentation `_ for more information about the structure and contents of some example files in soa-configs that PaaSTA uses. For more information about why we chose this method of config distribution, diff --git a/docs/source/style_guide.rst b/docs/source/style_guide.rst index 0c0eb5941e..8157422f27 100644 --- a/docs/source/style_guide.rst +++ b/docs/source/style_guide.rst @@ -47,9 +47,9 @@ Bad: * Anything going to scribe should ALSO go to stdout. Good: - * setup_marathon_job => general output to stdout, app-specific output to scribe + * setup_kubernetes_job => general output to stdout, app-specific output to scribe Bad: - * setup_marathon_job | stdint2scribe (no selective filtering, raw stdout dump) + * setup_kubernetes_job | stdint2scribe (no selective filtering, raw stdout dump) Good: * paasta itest => Sends summary of pass or fail to scribe event log. Sends full output of the run to the scribe debug log @@ -79,7 +79,7 @@ Event Level General Guidelines: * All event-level scribe logs should be as terse as possible while still providing a high level summary of the events occurring in the infrastructure. * All state changing events MUST have at least one event-level scribe log line emitted. -* It is not necessary to repeat redundant information, like service name, as all paasta log invocations already are service-specific anyway. +* It is not necessary to repeat redundant information, like service name, as all PaaSTA log invocations already are service-specific anyway. * All event level logs SHOULD use active verbs to indicate the action that took place. * Log lines SHOULD NOT contain the log level that they are using *in* the log line. Don't try to emulate syslog. * If an external URL with more context is available, the log line SHOULD reference it, but only if an error or warning is detected. @@ -104,7 +104,7 @@ Debug Level Debug Level General Guidelines: -* Viewing Debug level logs SHOULD NOT be necessary under normal paasta operation. +* Viewing Debug level logs SHOULD NOT be necessary under normal PaaSTA operation. * Debug logs are for providing additional context when things go wrong. * Debug logs should still use active verbs and not repeat redundant information if possible. * All debug-level logs should also go to stderr. diff --git a/docs/source/system_configs.rst b/docs/source/system_configs.rst index 9edfdf4e43..bca64f44b9 100644 --- a/docs/source/system_configs.rst +++ b/docs/source/system_configs.rst @@ -2,7 +2,7 @@ System Paasta Configs ===================== The "System Paasta Configs" inform Paasta about your environment and cluster setup, such as how to connect to -Marathon/hacheck/etc, what the cluster name is, etc. +Kubernetes/hacheck/etc, what the cluster name is, etc. Structure @@ -26,10 +26,7 @@ Configuration options These are the keys that may exist in system configs: - * ``zookeeper``: A zookeeper connection url, used for discovering where the Mesos leader is, and some locks. - Example: ``"zookeeper": "zk://zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/mesos"``. - - * ``docker_registry``: The name of the docker registry where paasta images will be stored. This can optionally + * ``docker_registry``: The name of the docker registry where PaaSTA images will be stored. This can optionally be set on a per-service level as well, see `yelpsoa_configs `_ Example: ``"docker_registry": "docker-paasta.yelpcorp.com:443"`` @@ -44,9 +41,8 @@ These are the keys that may exist in system configs: Example:: "dashboard_links": { - "uswest1-prod": { - "Mesos": "http://mesos.paasta-uswest1-prod.yelpcorp.com", - "Cluster charts": "http://kibana.yelpcorp.com/something", + "norcal-devc": { + "Tron": "http://y/tron-norcal-devc", } } @@ -97,23 +93,13 @@ These are the keys that may exist in system configs: Example: ``"sensu_port": 3031`` - * ``dockercfg_location``: A URI of a .dockercfg file, to allow mesos slaves - to authenticate with the docker registry. - Defaults to ``file:///root/.dockercfg``. - While this must be set, this file can contain an empty JSON dictionary (``{}``) if your docker registry does not - require authentication. - May use any URL scheme supported by Mesos's `fetcher module. `_ - - Example: ``"dockercfg_location": "http://somehost/somepath"`` - * ``synapse_port``: The port that haproxy-synapse exposes its status on. Defaults to ``3212``. Example: ``"synapse_port": 3213`` - * ``synapse_host``: The default host that paasta should interrogate for haproxy-synapse state. + * ``synapse_host``: The default host that PaaSTA should interrogate for haproxy-synapse state. Defaults to ``localhost``. - Primarily used in `check_marathon_services_replication `_. Example: ``"synapse_host": 169.254.255.254`` diff --git a/docs/source/workflow.rst b/docs/source/workflow.rst index 5aae3605d1..21db17c935 100644 --- a/docs/source/workflow.rst +++ b/docs/source/workflow.rst @@ -7,9 +7,9 @@ Ways That PaaSTA Can Run Services Long Running Services ^^^^^^^^^^^^^^^^^^^^^ -Long running services are are processes that are expected to run continuously +Long running services are processes that are expected to run continuously and usually have the same process id throughout. PaaSTA uses -`Marathon `_ to configure how these +`Kubernetes `_ to configure how these services should run. These services often serve network traffic, usually HTTP. PaaSTA integrates with @@ -61,68 +61,12 @@ Deployment A yelpsoa-configs master runs `generate_deployments_for_service `_ frequently. The generated ``deployments.json`` appears in ``/nail/etc/services/service_name`` throughout the cluster. -Marathon masters run `deploy_marathon_services `_, -a thin wrapper around ``setup_marathon_job``. -These scripts parse ``deployments.json`` and the current cluster state, -then issue commands to Marathon to put the cluster into the right state --- cluster X should be running version Y of service Z. - How PaaSTA Runs Docker Containers --------------------------------- -Marathon launches the Docker containers that comprise a PaaSTA service. - -Docker images are run by Mesos's native Docker executor. PaaSTA composes the -configuration for the running image: - -* ``--attach``: stdout and stderr from running images are sent to logs that end - up in the Mesos sandbox (currently unavailable). - -* ``--cpu-shares``: This is the value set in ``marathon.yaml`` as "cpus". - -* ``--memory``: This is the value set in ``marathon.yaml`` as "mem". - -* ``--memory-swap``: Total memory limit (memory + swap). We set this to the same value - as "mem", rounded up to the nearest MB, to prevent containers being able to swap. - -* ``--net``: PaaSTA uses bridge mode to enable random port allocation. - -* ``--env``: Any environment variables specified in the ``env`` section will be here. Additional - ``PAASTA_``, ``MARATHON_``, and ``MESOS_`` environment variables will also be injected, see the - `related docs `_ for more information. - -* ``--publish``: Mesos picks a random port on the host that maps to and exposes - port 8888 inside the container. This random port is announced to Smartstack - so that it can be used for load balancing. +Kubernetes launches the Docker containers that comprise a PaaSTA service. Once a Pod is scheduled to start, the kubelet on the node running the Pod interacts with the container runtime +through the Container Runtime Interface (CRI) to start the container defined in the Pod specification. -* ``--privileged``: Containers run by PaaSTA are not privileged. - -* ``--restart``: No restart policy is set on PaaSTA containers. Restarting - tasks is left as a job for the Framework (Marathon). - -* ``--rm``: Mesos containers are rm'd after they finish. - -* ``--tty``: Mesos containers are *not* given a tty. - -* ``--volume``: Volume mapping is controlled via the paasta_tools - configuration. PaaSTA uses the volumes declared in ``/etc/paasta/volumes.json`` - as well as per-service volumes declared in ``extra_volumes`` declared - in the `soa-configs `_. - -* ``--workdir``: Mesos containers are launched in a temporary "workspace" - directory on disk. Use the workdir sparingly and try not to output files. - -Mesos is the actual system that runs the docker images. In Mesos land these are -called "TASKS". PaaSTA-configured tasks use exponential backoff to prevent -unhealthy tasks from continuously filling up disks and logs -- the more times -that your service has failed to start, the longer Mesos will wait before -trying to start it again. - -Mesos *will* healthcheck the task based on the same healthcheck that SmartStack -uses, in order to prune unhealthy tasks. This pruning is less aggressive than -SmartStack's checking, so a dead task will go DOWN in SmartStack before it is -reaped by Marathon. By default the healthchecks occur every 10 seconds, and a service -must fail 30 times before that task is pruned and a new one is launched in its place. -This means a task had 5 minutes by default to properly respond to its healthchecks. +Note: Kubernetes supports containerd as the Container Runtime. Time Zones In Docker Containers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -180,15 +124,11 @@ Monitoring PaaSTA gives you a few `Sensu `_-powered monitoring checks for free: -* `setup_marathon_job `_: - Alerts when a Marathon service cannot be deployed or bounced for some reason. - It will resolve when a service has been successfully deployed/bounced. - -* `check_marathon_services_replication `_: +* **check_kubernetes_services_replication**: runs periodically and sends an alert if fewer than 50% of the requested instances are deployed on a cluster. If the service is registered in Smartstack it will look in Smartstack to count the available instances. Otherwise it - counts the number of healthy tasks in Mesos. + counts the number of healthy Pods in Kubernetes. The PaaSTA command line @@ -197,7 +137,7 @@ The PaaSTA command line The PaaSTA command line interface, ``paasta``, gives users of PaaSTA the ability to inspect the state of services, as well as stop and start existing services. See the man pages for a description and detail of options for any -individual paasta command. Some of the most frequently used commands are +individual PaaSTA command. Some of the most frequently used commands are listed below: * ``paasta start`` - sets the desired state of the service instance to @@ -214,4 +154,4 @@ listed below: **NB**: ``paasta stop`` is a temporary measure; that is, it's effect only lasts until you deploy a new version of your service. That means that if you run ``paasta stop`` and push a version of the docker image serving your service, then - paasta will reset the effect of ``paasta stop``. + PaaSTA will reset the effect of ``paasta stop``. diff --git a/docs/source/yelpsoa_configs.rst b/docs/source/yelpsoa_configs.rst index a247eca71b..de0519c97e 100644 --- a/docs/source/yelpsoa_configs.rst +++ b/docs/source/yelpsoa_configs.rst @@ -14,7 +14,7 @@ so you are free to use them for YAML templates. **Note** that service names (the name of the folder where your config file is located) should be no more than 63 characters. For kubernetes services(config files with kubernetes as prefix), the instance names should be no more than 63 characters as well. -_ is counted as two character. We convert _ to -- because underscore is not allowed in kubernetes pod names. +_ is counted as two character. We convert _ to -- because underscore is not allowed in kubernetes Pod names. Example:: @@ -41,26 +41,21 @@ Example:: All configuration files that define something to launch on a PaaSTA Cluster can specify the following options: - * ``cpus``: Number of CPUs an instance needs. Defaults to 1. CPUs in Mesos - are "shares" and represent a minimal amount of a CPU to share with a task - relative to the other tasks on a host. A task can burst to use any - available free CPU, but is guaranteed to get the CPU shares specified. For - a more detailed read on how this works in practice, see the docs on `isolation `_. + * ``cpus``: Number of CPUs an instance needs. Defaults to 1. CPUs in Kubernetes + are "shares" and represent a minimal amount of a CPU to share with a Pod + relative to the other Pods on a host. For a more detailed read on + how this works in practice, see the docs on `isolation `_. * ``cpu_burst_add``: Maximum number of additional CPUs an instance may use while bursting; if unspecified, PaaSTA defaults to 1 for long-running services, and 0 for scheduled jobs (Tron). For example, if a service specifies that it needs 2 CPUs normally and 1 for burst, the service may go up to 3 CPUs, if needed. - * ``mem``: Memory (in MB) an instance needs. Defaults to 4096 (4GB). In Mesos - memory is constrained to the specified limit, and tasks will reach + * ``mem``: Memory (in MB) an instance needs. Defaults to 4096 (4GB). In Kubernetes + memory is constrained to the specified limit, and containers will reach out-of-memory (OOM) conditions if they attempt to exceed these limits, and - then be killed. There is currently not way to detect if this condition is - met, other than a ``TASK_FAILED`` message. For more a more detailed read on + then be killed. For more a more detailed read on how this works, see the docs on `isolation `_ - * ``disk``: Disk (in MB) an instance needs. Defaults to 1024 (1GB). Disk limits - may or may not be enforced, but services should set their ``disk`` setting - regardless to ensure the scheduler has adequate information for distributing - tasks. + * ``disk``: Disk (in MB) an instance needs. Defaults to 1024 (1GB). * ``env``: A dictionary of environment variables that will be made available to the container. PaaSTA additionally will inject the following variables automatically (keep in mind all environment variables are strings in a shell): @@ -74,11 +69,13 @@ specify the following options: * ``PAASTA_GIT_SHA``: The short git sha of the code the container has * ``PAASTA_DEPLOY_GROUP``: The `deploy group `_ specified * ``PAASTA_MONITORING_TEAM``: The team that is configured to get alerts. - * ``PAASTA_LAUNCHED_BY``: May not be present. If present, will have the username of the user who launched the paasta container + * ``PAASTA_LAUNCHED_BY``: May not be present. If present, will have the username of the user who launched the PaaSTA container * ``PAASTA_RESOURCE_CPUS``: Number of cpus allocated to a container * ``PAASTA_RESOURCE_MEM``: Amount of ram in MB allocated to a container * ``PAASTA_RESOURCE_DISK``: Amount of disk space in MB allocated to a container * ``PAASTA_RESOURCE_GPUS``: Number of GPUS (if requested) allocated to a container + * ``PAASTA_IMAGE_VERSION``: The version of the docker image + * ``PAASTA_INSTANCE_TYPE``: The instance type of the service (e.g: tron, kubernetes, eks, etc) * ``extra_volumes``: An array of dictionaries specifying extra bind-mounts @@ -112,7 +109,7 @@ Placement Options ----------------- Placement options provide control over how PaaSTA schedules a task, whether it -is scheduled by Marathon (on Mesos), Kubernetes, Tron, or ``paasta remote-run``. +is scheduled by Kubernetes, Tron, or ``paasta remote-run``. Most commonly, it is used to restrict tasks to specific locations. .. _general-placement-options: @@ -120,7 +117,7 @@ Most commonly, it is used to restrict tasks to specific locations. General ^^^^^^^ -These options are applicable to tasks scheduled through Mesos or Kubernetes. +These options are applicable to tasks scheduled through Kubernetes. * ``deploy_blacklist``: A list of lists indicating a set of locations to *not* deploy to. For example: @@ -256,7 +253,7 @@ For more information on selector operators, see the official Kubernetes documentation on `node affinities `_. - * ``pod_management_policy``: An option for applications managed with `StatefulSets `_ to determine if the pods are managed in parallel or in order. + * ``pod_management_policy``: An option for applications managed with `StatefulSets `_ to determine if the Pods are managed in parallel or in order. The default value is `OrderedReady `_. It can be set to `Parallel `_. For example:: @@ -264,31 +261,6 @@ documentation on `node affinities pod_management_policy: Parallel -.. _mesos-placement-options: - -Mesos -^^^^^ - -These options are applicable only to tasks scheduled on Mesos. - - * ``constraints``: Overrides the default placement constraints for services. - Should be defined as an array of arrays (E.g ``[["habitat", "GROUP_BY"]]`` - or ``[["habitat", "GROUP_BY"], ["hostname", "UNIQUE"]]``). Defaults to - ``[[", "GROUP_BY"], ["pool", "LIKE", ], - [, "UNLIKE", ], ...]`` - where ```` is defined by the ``discover`` attribute - in ``smartstack.yaml``, ```` is defined by the ``pool`` attribute in - ``marathon.yaml``, and ``deploy_blacklist_type`` and - ``deploy_blacklist_value`` are defined in the ``deploy_blacklist`` attribute - in marathon.yaml. For more details and other constraint types, see the - official `Marathon constraint documentation - `_. - - * ``extra_constraints``: Adds to the default placement constraints for - services. This acts the same as ``constraints``, but adds to the default - constraints instead of replacing them. See ``constraints`` for details on - format and the default constraints. - ``kubernetes-[clustername].yaml`` ------------------------------- @@ -400,7 +372,7 @@ instance MAY have: Default value is 0.8. * ``desired_active_requests_per_replica``: Only valid for the ``active-requests`` metrics provider. The - target number of requests per second each pod should be receiving. + target number of requests per second each Pod should be receiving. * ``max_instances_alert_threshold``: If the autoscaler has scaled your service to ``max_instances``, and the service's utilization (as measured by your ``metrics_provider``) is above this value, you'll get an alert. @@ -457,7 +429,7 @@ instance MAY have: A failing readiness probe will not restart the instance, it will however be removed from the mesh and not receive any new traffic. - To add an additional delay after the pod has started and before probes should + To add an additional delay after the Pod has started and before probes should start, see ``min_task_uptime``. * ``healthcheck_interval_seconds``: Kubernetes will wait this long between @@ -474,6 +446,10 @@ instance MAY have: Defaults to the same uri specified in ``smartstack.yaml``, but can be set to something different here. + * ``net``: Specify which kind of + `networking mode `_ + adhoc containers of this service should be launched using. Defaults to ``'bridge'``. + * ``prometheus_shard``: Optional name of Prometheus shard to be configured to scrape the service. This shard should already exist and will not be automatically created. @@ -489,12 +465,12 @@ instance MAY have: accessed externally. This option is implied when registered to smartstack or when specifying a ``prometheus_port``. Defaults to ``false`` - * ``weight``: Load balancer/service mesh weight to assign to pods belonging to this instance. - Pods should receive traffic proportional to their weight, i.e. a pod with - weight 20 should receive 2x as much traffic as a pod with weight 10. + * ``weight``: Load balancer/service mesh weight to assign to Pods belonging to this instance. + Pods should receive traffic proportional to their weight, i.e. a Pod with + weight 20 should receive 2x as much traffic as a Pod with weight 10. Defaults to 10. Must be an integer. - This only makes a difference when some pods in the same load balancer have different weights than others, such as when you have two or more instances with the same ``registration`` but different ``weight``. + This only makes a difference when some Pods in the same load balancer have different weights than others, such as when you have two or more instances with the same ``registration`` but different ``weight``. * ``lifecycle``: A dictionary of additional options that adjust the termination phase of the `pod lifecycle `_: This currently supports two sub-keys: @@ -522,225 +498,6 @@ a container is unhealthy, and the action to take is to completely destroy it and launch it elsewhere. This is a more expensive operation than taking a container out of the load balancer, so it justifies having less sensitive thresholds. -``marathon-[clustername].yaml`` -------------------------------- - -e.g. ``marathon-pnw-prod.yaml``, ``marathon-mesosstage.yaml``. The -clustername is usually the same as the ``superregion`` in which the cluster -lives (``pnw-prod``), but not always (``mesosstage``). It MUST be all -lowercase. (non alphanumeric lowercase characters are ignored) - -**Note:** All values in this file except the following will cause PaaSTA to -`bounce `_ the service: - -* ``min_instances`` -* ``instances`` -* ``max_instances`` -* ``backoff_seconds`` - -Top level keys are instance names, e.g. ``main`` and ``canary``. Each -instance MAY have: - - * Anything in the `Common Settings`_. - - * Anything from :ref:`General Placement Options ` - and :ref:`Mesos Placement Options `. - - * ``cap_add``: List of capabilities that are passed to Docker. Defaults - to empty list. Example:: - - "cap_add": ["IPC_LOCK", "SYS_PTRACE"] - - * ``instances``: Marathon will attempt to run this many instances of the Service - - * ``min_instances``: When autoscaling, the minimum number of instances that - marathon will create for a service. Defaults to 1. - - * ``max_instances``: When autoscaling, the maximum number of instances that - marathon will create for a service - - * ``registrations``: A list of SmartStack registrations (service.namespace) - where instances of this PaaSTA service ought register in. In SmartStack, - each service has difference pools of backend servers that are listening on - a particular port. In PaaSTA we call these "Registrations". By default, the - Registration assigned to a particular instance in PaaSTA has the *same name*, - so a service ``foo`` with a ``main`` instance will correspond to the - ``foo.main`` Registration. This would correspond to the SmartStack - namespace defined in the Registration service's ``smartstack.yaml``. This - ``registrations`` option allows users to make PaaSTA instances appear - under an *alternative* namespace (or even service). For example - ``canary`` instances can have ``registrations: ['foo.main']`` to route - their traffic to the same pool as the other ``main`` instances. - - * ``backoff_factor``: PaaSTA will automatically calculate the duration of an - application's backoff period in case of a failed launch based on the number - of instances. For each consecutive failure that duration is multiplied by - ``backoff_factor`` and added to the previous value until it reaches - ``max_launch_delay_seconds``. See `Marathon's API docs `_ - for more information. Defaults to 2. - - * ``max_launch_delay_seconds``: The maximum time marathon will wait between attempts - to launch an app that previously failed to launch. See `Marathon's API docs - `_ for more information. Defaults to 300 seconds. - - .. _net: - - * ``net``: Specify which kind of - `networking mode `_ - instances of this service should be launched using. Defaults to ``'bridge'``. - - * ``container_port``: Specify the port to expose when in ``bridge`` mode. - Defaults to ``8888``. - - * ``bounce_method``: Controls the bounce method; see `bounce_lib `_ - - * ``bounce_health_params``: A dictionary of parameters for get_happy_tasks. - - * ``check_haproxy``: Boolean indicating if PaaSTA should check the local - haproxy to make sure this task has been registered and discovered - (Defaults to ``True`` if service is in SmartStack) - - * ``min_task_uptime``: Minimum number of seconds that a task must be - running before we consider it healthy (Disabled by default) - - * ``haproxy_min_fraction_up``: if ``check_haproxy`` is True, we check haproxy on up to 20 boxes to see whether a task is available. - This fraction of boxes must agree that the task is up for the bounce to treat a task as healthy. - Defaults to 1.0 -- haproxy on all queried boxes must agree that the task is up. - - * ``bounce_margin_factor``: proportionally increase the number of old instances - to be drained when the crossover bounce method is used. - 0 < bounce_margin_factor <= 1. Defaults to 1 (no influence). - This allows bounces to proceed in the face of a percentage of failures. - It doesn’t affect any other bounce method but crossover. - See `the bounce docs `_ for a more detailed description. - - * ``bounce_start_deadline``: a floating point number of seconds to add to the deadline when deployd notices a change - to soa-configs or the marked-for-deployment version of an instance. - Defaults to 0. (deadline = now) - When deployd has a queue of instances to process, it will choose to process instances with a lower deadline first. - Set this to a large positive number to allow deployd to process other instances before this one, even if their - soa-configs change or mark-for-deployment happened after this one. - This setting only affects the first time deployd processes an instance after a change -- - instances that need to be reprocessed will be reenqueued normally. - - * ``drain_method``: Controls the drain method; see `drain_lib - `_. Defaults to ``noop`` for - instances that are not in Smartstack, or ``hacheck`` if they are. - - * ``drain_method_params``: A dictionary of parameters for the specified - drain_method. Valid parameters are any of the kwargs defined for the - specified bounce_method in `drain_lib `_. - - * ``cmd``: The command that is executed. Can be used as an alternative to - args for containers without an `entrypoint - `_. This value is - wrapped by Mesos via ``/bin/sh -c ${app.cmd}``. Parsing the Marathon config - file will fail if both args and cmd are specified [#note]_. - - * ``args``: An array of docker args if you use the `"entrypoint" - `_ functionality. - Parsing the Marathon config file will fail if both args and cmd are - specified [#note]_. - - * ``monitoring``: See the `monitoring.yaml`_ section for details. - - * ``autoscaling``: See the `autoscaling docs `_ for valid options and how they work - - * ``metrics_provider``: Which method PaaSTA will use to determine a service's utilization. - - * ``decision_policy``: Which method PaaSTA will use to determine when to autoscale a service. - - * ``deploy_group``: A string identifying what deploy group this instance belongs - to. The ``step`` parameter in ``deploy.yaml`` references this value - to determine the order in which to build & deploy deploy groups. Defaults to - ``clustername.instancename``. See the deploy group doc_ for more information. - - * ``replication_threshold``: An integer representing the percentage of instances that - need to be available for monitoring purposes. If less than ``replication_threshold`` - percent instances of a service's backends are not available, the monitoring - scripts will send a CRITICAL alert. - -In addition, each instancename MAY configure additional Marathon healthcheck -options (Read the official -`mesos documentation `_ -for more low-level details: - - * ``healthcheck_mode``: One of ``cmd``, ``tcp``, ``http``, or ``https``. - If set to ``http`` or ``https``, a ``curl`` command will be executed - inside the container. - - If set to ``cmd`` then PaaSTA will execute ``healthcheck_cmd`` and - examine the return code. It must return 0 to be considered healthy. - - If the service is registered in SmartStack, the healthcheck_mode will - automatically use the same setings specified by ``smartstack.yaml``. - - If not in smartstack, the default healthcheck is "None", which means - the container is considered healthy unless it crashes. - - A http healthcheck is considered healthy if it returns a 2xx or 3xx - response code. - - * ``healthcheck_cmd``: If ``healthcheck_mode`` is set to ``cmd``, then this - command is executed inside the container as a healthcheck. It must exit - with status code 0 to signify a successful healthcheck. Any other exit code - is treated as a failure. This is a required field if ``healthcheck_mode`` - is ``cmd``. - - * ``healthcheck_grace_period_seconds``: Marathon will wait this long for a - service to come up before counting failed healthchecks. Defaults to 60 - seconds. - - * ``healthcheck_interval_seconds``: Marathon will wait this long between - healthchecks. Defaults to 10 seconds. - - * ``healthcheck_timeout_seconds``: Marathon will wait this long for a - healthcheck to return before considering it a failure. Defaults to 10 - seconds. - - * ``healthcheck_max_consecutive_failures``: Marathon will kill the current - task if this many healthchecks fail consecutively. Defaults to 30 attempts. - - * ``healthcheck_uri``: The url of the service to healthcheck if using http. - Defaults to the same uri specified in ``smartstack.yaml``, but can be - set to something different here. - -**Note**: Although many of these settings are inherited from ``smartstack.yaml``, -their thresholds are not the same. The reason for this has to do with control -loops and infrastructure stability. The load balancer tier can be pickier -about which copies of a service it can send requests to, compared to Mesos. - -A load balancer can take a container out of service and put it back in a few -seconds later. Minor flaps and transient errors are tolerated. - -The healthchecks specified here in this file signal to the infrastructure that -a container is unhealthy, and the action to take is to completely destroy it and -launch it elsewhere. This is a more expensive operation than taking a container -out of the load balancer, so it justifies having less sensitive thresholds. - -**Footnotes**: - -.. [#note] The Marathon docs and the Docker docs are inconsistent in their - explanation of args/cmd: - - The `Marathon docs - `_ - state that it is invalid to supply both cmd and args in the same app. - - The `Docker docs `_ - do not state that it's incorrect to specify both args and cmd. Furthermore, - they state that "Command line arguments to docker run will be - appended after all elements in an exec form ENTRYPOINT, and will override - all elements specified using CMD" which implies that both cmd and args can - be provided, but cmd will be silently ignored. - - To avoid issues resulting from this discrepancy, we abide by the stricter - requirements from Marathon and check that no more than one of cmd and args - is specified. If both are specified, an exception is thrown with an - explanation of the problem, and the program terminates. - -.. _doc: deploy_groups.html - ``tron-[clustername].yaml`` -------------------------------- @@ -796,8 +553,6 @@ Each Tron **action** of a job MAY specify the following: * Anything in the `Common Settings`_. * Anything from :ref:`General Placement Options ` - and :ref:`Mesos Placement Options ` (currently, Tron - only supports Mesos workloads). * ``service``: Uses a docker image from different service. When ``service`` is set for an action, that setting takes precedence over what is set for the job. @@ -826,7 +581,7 @@ If a Tron **action** of a job is of executor type ``spark``, it MAY specify the * ``spark_args``: Dictionary of spark configurations documented in https://spark.apache.org/docs/latest/configuration.html. Note some configurations are non- - user-editable as they will be populated by paasta tools. See + user-editable as they will be populated by PaaSTA. See https://github.com/Yelp/service_configuration_lib/blob/master/service_configuration_lib/spark_config.py#L9 for a complete list of such configurations. @@ -851,7 +606,22 @@ Each instance MAY have: * ``deploy_group`` -See the `marathon-[clustername].yaml`_ section for details for each of these parameters. +See the `kubernetes-[clustername].yaml`_ section for details for each of these parameters. + +**Footnotes**: + +.. [#note] The Docker docs explanation on using both args and cmd: + The `Docker docs `_ + do not state that it's incorrect to specify both args and cmd. Furthermore, + they state that "Command line arguments to docker run will be + appended after all elements in an exec form ENTRYPOINT, and will override + all elements specified using CMD" which implies that both cmd and args can + be provided, but cmd will be silently ignored. + + To avoid issues resulting from this discrepancy, we abide by the stricter + requirements and check that no more than one of cmd and args + is specified. If both are specified, an exception is thrown with an + explanation of the problem, and the program terminates. ``smartstack.yaml`` ------------------- @@ -872,7 +642,7 @@ Here is an example smartstack.yaml:: The ``main`` key is the service namespace. Namespaces were introduced for PaaSTA services in order to support running multiple daemons from a single -service codebase. In PaaSTA, each instance in your marathon.yaml maps to a +service codebase. In PaaSTA, each instance in your kubernetes.yaml maps to a smartstack namespace of the same name, unless you specify a different ``registrations``. @@ -890,7 +660,7 @@ Basic HTTP and TCP options it will generate synapse discovery files on every host, but no listening port will be allocated. This must be unique across all environments where PaaSTA (or synapse) runs. At Yelp, we pick from the range [19000, 21000]. - Feel free to pick the next available value -- paasta fsm will do this for + Feel free to pick the next available value -- ``paasta fsm`` will do this for you automatically! * ``mode``: string of value ``http`` or ``tcp``, specifying whether the service @@ -1132,12 +902,6 @@ An example of switching from region to superregion discovery: - advertise: [region] + advertise: [region, superregion] -1b. When moving from a large grouping to a smaller grouping (like -moving from superregion => region) you must add an additional constraint -to ensure Marathon balances the tasks evenly:: - - extra_constraints: [['region', 'GROUP_BY', 2]] - 2. (Optional) Use zkCli.sh to monitor your new registrations for each superregion you are changing:: @@ -1147,7 +911,7 @@ superregion you are changing:: [host1-uswest1adevc_0000015910, host2-uswest1cdevc_0000015898, host3-uswest1cdevc_0000015893] [zk: 10.40.5.6:22181(CONNECTED) 2] -2b. Run ``paasta status -v`` to verify that Marathon has balanced services +2b. Run ``paasta status -v`` to verify that PaaSTA has balanced services across the infrastructure as expected. 3. Once zookeeper shows the proper servers, switch the discovery key:: @@ -1256,7 +1020,7 @@ An example of a service that only pages on a cluster called "prod":: team: devs page: false - # marathon-prod.yaml + # kubernetes-prod.yaml main: instances: 3 monitoring: @@ -1275,13 +1039,13 @@ A service that pages everywhere, but only makes a ticket for a tron job:: page: false ticket: true -A marathon/kubernetes service that overrides options on different instances (canary):: +A kubernetes service that overrides options on different instances (canary):: # monitoring.yaml team: frontend page: false - # marathon-prod.yaml or kubernetes-prod.yaml + # kubernetes-prod.yaml main: instances: 20 monitoring: