09 September 2019
These are my notes from reading Kubernetes Up & Running by Kelsey Hightower, Brendan Burns, and Joe Beda. Kelsey Hightower is a Staff Developer Advocate for the Google Cloud Platform. Brendan Burns is a Distinguished Engineer in Microsoft Azure and cofounded the Kubernetes project at Google. Joe Beda is the CTO of Heptio and cofounded the Kubernetes project, as well as Google Compute Engine.
This is a phenomenal book that covers both the whys and hows of Kubernetes. I read the 1st edition, but a 2nd edition is coming out soon. I’m using this as study material for my CKAD and CKA certifications.
This article is part of a series. You can read Part 1 here.
Note to the reader: Throughout these articles, I am loose with my formatting of Kubernetes objects, especially in the case of
Service versus Service. This is to avoid confusion with the common term of service.
Constant width formatting is typically reserved for program elements, such as k8s API objects, parameters, or processes.
Pods represent a collection of application containers and volumes running in the same execution environment. Pods are the most atomic object within Kubernetes, therefore all containers within a Pod will always land on the same node within a cluster. Each container runs in its own
cgroup and we will leverage that using
limits later on in this chapter. All applications running in a Pod share the same hostname, IP address and port space. Containers within a Pod can communicate with one another, but are isolated from containers in other Pods. Since a Pod can be composed of multiple containers, it’s natural to ask which containers should be included in any given Pod. The question to ask yourself is “Will these containers work correctly if they land on separate machines?” If the answer is “no”, then the containers should be placed in the same Pod. As we will see, it’s often the case that containers within the same Pod interact via a shared volume that is also within the Pod.
Pods are managed via YAML or JSON templates known as a Pod manifest. These manifests are text-file representations of the k8s API object. They are sent to
kube-apiserver, which in turn passes them to the
kube-scheduler. The scheduler then places the Pod onto a node that has sufficient resources. A daemon on the node, named
kubelet, creates the containers declared in the Pod manifest and performs any health checks. Once a Pod as been scheduled to a node, there is no rescheduling if the node fails. The Kubernetes scheduler takes this into account and tries to ensure that Pods from the same application are distributed onto different nodes, in order to avoid a single failure domain. Once scheduled, will only change nodes if they are explicitly destroyed and rescheduled. When a Pod is deleted, it has a termination grace period with a default value of 30 seconds. This grace period allows the Pod to finish serving any active requests that it is processing before terminating. Any data stored within the containers on the Pod will be lost when the Pod terminates. The
PersistentVolume object can be used to store data across multiple instances of Pod and their subsequent lifespans.
Once a Pod is running, you may want to access it even if it is not serving traffic on the Internet. You can use port forwarding to create a secure tunnel from your local machine, through the Kubernetes master, to the instance of the Pod running on a worker node.
There are several types of health checks for containers within Kubernetes. The first is a process health check. This health check simply checks if the main process of your application is running. If it isn’t, Kubernetes restarts it. This health check is automatic and does not need to be defined. A simple process check is often insufficient however. Imagine that your process is deadlocked - the health check will be green, but your application will be red. To address this, Kubernetes supports liveness health checks. Liveness health checks are defined within the Pod manifest per container and each container is health checked separately. They run application-specific logic to ensure that the application is not only running, but functioning properly. Lastly, there are readiness health checks. These checks are configured similarly to liveness probes. The difference is that containers which fail a liveness health check are restarted while containers that file a readiness probe are removed from Service load balancers. Combining liveness and readiness probes ensures that all traffic is routed to healthy containers that have capacity to fulfill the request. Kubernetes also supports
tcpSocket health checks, which are useful for databases r other non-HTTP-based APIs. Finally, k8s supports
exec probes. These execute a script in the context of a container - if the script returns zero, the probe succeeds. These are useful for applications where validation logic doesn’t fit neatly into an HTTP request.
One of the most powerful features of Pods is the ability to define the minimum and maximum compute resources available for the containers within the Pod. By rightsizing container requirements,
kube-scheduler is able to efficiently pack Pods onto nodes, thereby driving up utilization. Requests specify the minimum resources needed by a Pod in order to be scheduled - limits specify the maximum resources a container may use. Both are declared at the container level within the Pod manifest. It’s worth mentioning that memory (unlike CPU) cannot be redistributed if it has already been allocated to a container process. Therefore, when the system runs out of memory,
kubelet terminates containers whose memory usage is greater than their requested amount.
Volumes are used for containers within a Pod to share some kind of state information. They can be used to communicate, synchronize, cache, or even persist state data beyond the lifespan of the Pod itself. Volumes are declared within the Pod manifest and container declarations may include
volumeMounts. It’s important to note that containers can mount the same volume to different paths.
emptyDir volumes can be used to create a shared filesystem between containers.
hostDir volumes can grant access to the underlying host filesystem. There are multiple supported protocols for remote network storage volumes that can be used for truly persistent data, including NFS, iSCSI, and cloud network storage like AWS EBS, Azure FDS, and GCP Persistent Disk.
Labels and annotations are types of metadata about Kubernetes API objects. They each serve a different purpose. Labels are used to identify and select sets of objects, while annotations are designed to hold nonidentifying information that can be leveraged by tools and libraries.
Kubernetes uses labels to group and select objects. There are a variety of ways to express the selection logic:
||key is set|
||key is not set|
You can use
kubectl and labels to find and manage objects. Kubernetes itself also uses labels to accomplish the same thing.
Annotations are key-value pairs, much like labels, but they have less restrictions around string formatting. While this makes them exceedingly useful, it also means that there is no guarantee that the data contained in the annotation is valid. The nature/formatting of the data contained in the key-value pair can dictate whether it has to be an annotation or not. Outside of that, it’s a good practice to add information about an object as an annotation and promote it to a label if you find yourself wanting to use it in a selector.
Annotations can be used to:
Service objects are used to expose Pods within a cluster and load-balance across them. Usually, a
Service uses a named label selector to identify the Pods, but it is also possible to define a
Service without selectors. By default, when a
Service is created, k8s assigns it a virtual IP called a cluster IP. The cluster IP load-balances across the selected Pods. It also has a DNS name within
kube-dns. In true k8s-building-on-k8s fashion,
kube-dns is a
Service with a
Service objects are responsible for load-balancing, they also handle the management of readiness checks. Besides
ClusterIP, there are a few other
NodePort builds upon
ClusterIP and is used to serve traffic from outside of the cluster. In addition to a
kube-apiserver assigns a port to the
Service (or you can specify the port yourself) and all nodes will forward traffic to that port to the
Service. This means that if you can reach any node in the cluster, you can contact a service without knowing where that service is running.
LoadBalancer builds upon
NodePort. It integrates with cloud providers to provision a new load balancer and direct it at the nodes in your cluster, resulting in an IP or hostname depending on the cloud provider. Bonus material: there is also
ExternalName if your cluster is running CoreDNS v1.7 or higher.
Service object, Kubernetes creates an
Endpoints object which contains the IP addresses for that service. Kubernetes uses
Endpoints to consume services rather than a cluster IP - which means your applications can too! By talking directly the k8s API, you can retrieve and call service endpoints. Kubernetes can also “watch” objects and be notified when they change, including the IPs associated with a service. Most applications are not written to be Kubernetes-native and therefore don’t leverage this, but it’s important to know that it’s possible.
Regardless of the type, all
Service objects will load-balance. This is made possible by a component named
kube-proxy that runs on every node in the cluster.
kube-proxy watches for new services via
kube-apiserver. It programs a set of
iptables rules (abstracted as
ClusterIP) on in the kernel of the host, which change the destination of packets to one of the endpoints for that service. Whenever the set of endpoints changes, the set of
iptables rules are rewritten. It is possible for a user to specify a specific cluster IP when creating a service. Once created, the cluster IP cannot be changed without deleting and rebuilding the
Service object. In order to be assigned, a cluster IP must come from the cluster IP range defined by
kube-apiserver and not already be in use. It’s possible to configure the service address range using the
--service-cluster-ip-range flag on the
kube-apiserver binary. The range should not overlap with the IP subnets and ranges assigned to each Docker bridge or Kubernetes node.
ReplicaSets are used to manage a set of Pods (even if the set consists of just a single Pod.) They accomplish this by using a very common pattern in Kubernetes called a reconciliation loop. The basic idea of a reconciliation loop is to input a desired state (e.g., the number of Pods we’d like to have running) and then constantly monitor or observe the current state of the k8s environment. If the reconciliation loop finds that
desired == current, no action is taken, but if the opposite is true, it will create or destroy pods as needed until the the desired state is reached. Although, it’s important to understand that ReplicaSets an the Pods that they manage are loosely coupled. While a ReplicaSet may create or delete Pods, it does not own them. It manages them via a label selector defined in the ReplicaSet specification. This is also true of the Services that may load balance to the Pods - everything is very loosely coupled. This may seem daunting at first, but declarative configuration makes managing it much easier. There are also several distinct benefits to having these different k8s API objects being loosely coupled:
Sometimes you may wonder which ReplicaSet is managing a Pod. This kind of discovery is enabled by the ReplicaSet creating an annotation on every Pod it creates. The annotation key is
kubernetes.io/created-by. This annotation is best-effort and only created when the Pod is created by the ReplicaSet - a user can remove the annotation at any time. If you are otherwise interesting in finding the Pods managed by a particular ReplicaSet, you can perform
$ kubectl get pods -l <key1=value1>,<key2=value2>, where the labels are defined in the ReplicaSet spec. This is exactly the same API call that the ReplicaSet itself uses to find and manage its set of Pods.
ReplicaSets support imperative scaling via
$ kubectl scale <replica-set-name> --replicas=4 for example. However, this is a stop gap measure to be sure. All declarative text-file configurations should be updated ASAP to reflect any imperative changes. This is to avoid a situation where the current state is dramatically different from a new, desired state being applied from a text-file configuration. Kubernetes also supports horizontal pod autoscaling (HPA). This is distinctly different from vertical pod autoscaling (giving Pods greater resource requests and limits) and cluster autoscaling (adding more worker nodes to the cluster). HPA requires the presence of the
heapster Pod on your cluster, which included in most k8s installations by default and runs in the
heapster keeps track of resource consumption metrics and provides an API to use when making autoscaling decisions. HPA is a separate k8s object from ReplicaSets and thus is loosely coupled. It is a bad idea to combine autoscaling with imperative or declarative management of the number of replicas. If you are using HPA, just manage the HPA object itself. Interfering with it could result in unexpected behavior.
By default, when you delete a ReplicaSet, you delete the Pods it is managing. You can avoid this by setting
--cascade=false in your command - e.g.,
$ kubectl delete rs galens-rs --cascade=false.