Containers
Containers run applications in an isolated namespace, meaning it only has access to resources that are made available to it by the container runtime. Resource governance means that a container has access only to a specified number of processor cycles, system memory, and other resources. Containers allow applications to be packaged with their dependencies in container images, which will run the same regardless of underlying operating system or infrastructure and are downloaded from container registries like Docker Hub. Container registries are not to be confused with repositories, which are subcomponents of registries.
Cgroups
History
"Task Control Groups" were first merged into kernel 2.6.24 with the ability for multiple hierarchies to be created. The logic behind the creation of multiple hierarchies was to enable maximum flexibility in policy creation. However, because a controller could only belong to a single hierarchy, after some years this began to be seen as a design flaw.
This motivated a redesign into a single unified cgroup hierarchy, and v2 was merged in 3.16 and made stable in 4.5.
Control groups or cgroups are a Linux kernel feature that isolates, labels, and manages resources (CPU time, memory, and network bandwidth) for a collection of tasks (processes).
Cgroup subsystems (also called controllers or resource controllers in documentation) represent a single resource (i.e. io, cpu, memory, devices).
- freezer suspends or resumes tasks in a cgroup
- net_cls tags network packets with a class identifier that allows the Linux traffic controller to identify packets originating from a particular cgroup task
- net_prio allows network traffic to be prioritized per interface
- ns the namespace subsystem
Since cgroups v2, all mounted controllers reside in a single unified hierarchy. A list of these is generated by the Linux kernel at /proc/cgroups.
You can confirm that the cgroup2 filesystem is mounted at /sys/fs/cgroup
mount -l | grep cgroup -
Kubernetes
History
Kubernetes was first announced by Google in mid-2014. It had been developed by Google after deciding to open-source the Borg system, a cluster and container management system that formed the automation infrastructure that powered the entire Google enterprise. Kubernetes coalesced from a fusion between developers working on Borg and Compute Engine. Borg eventually evolved into Omega.
By that time, Amazon had established a market advantage and the developers decided to change their approach by introducing a disruptive technology to drive the relevance of the Compute platform they had built. They created a ubiquitous abstraction that could run better than anyone else.
At the time, Google had been trying to engage the Linux kernel team and trying to overcome their skepticism. Internally, the project was framed as offering "Borg as a Service", although there were concerns that Google was in danger of revealing trade secrets.
Google ultimately donated Kubernetes to the Cloud Native Computing Foundation.
Kubernetes's heptagonal logo is an allusion to when it was called "Project 7" as a reference to Star Trek's Borg character 7 of 9.
Kubernetes (Greek for "helmsman", "pilot", or "captain" and "k8s" for short) has emerged as the leading container orchestrator in the industry since 2018. It provides a layer that abstracts infrastructure, including computers, networks, and other computers, for applications deployed on top.
Kubernetes can be visualized as a system built from layers, with each higher layer abstracting the complexity of the lower levels. One server serves as the master, exposing an API for users and clients, assigning workloads, and orchestrating communication between other components. The processes on the master node are also called the control plane. Other machines in the cluster are called nodes or workers and accept and run workloads using available resources. A Kubernetes configuration files is called a kubeconfig.
Kubernetes resources or objects, each associated with a URL, represent the configuration of a cluster. Resource and object are often used interchangeably, but more precisely the resource refers to the URL path that points to the object, and an object may be accessible through multiple resources. Every object type in the Kubernetes API has a controller (i.e. deployment controller, etc.) that reads desired state from the Spec section of the manifest and reports its actual state by writing to the Status section.
An object's manifest, presented in JSON or YAML, represents its declarative configuration, and contains four sections:
- type metadata, specifying the type of resource
- object metadata, specifying name and other identifying information
- spec: desired state of resource
- state: produced strictly by the resource controller and represents the current status of resource
An explanation of each field available in the API of any object type can be displayed on the command-line
kubectl explain nodes
kubectl explain no.spec
kubectl get node $NODE -o yaml
kubectl describe node kind-worker-2
A pod is the most atomic unit of work which encompasses one or more tightly-coupled containers that will be deployed together on the same node. All containers in a pod share the same Linux namespace, hostname, IP address, network interfaces, and port space. This means containers in a pod can communicate with each other over localhost, although care must be taken to ensure individual containers do not attempt to use the same port. However their filesystems are isolated from one another unless they share a Volume.
Every Pod occupies one address in a shared range, so communication between Pods is simple.
Compute resources of containers can be limited at pod.spec.containers.resources.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
memory: "64Mi"
cpu: "500m"
limits:
memory: "128Mi"
cpu: "500m"
Kubernetes can monitor Pod health by using probes, which can be categorized by how they measure health:
- Readiness: i.e. Is the container ready to serve user requests?
- Liveness: i.e. Is the container running as intended?
Tasks
GKE
-
Google Kubernetes Engine nodes are actually Google Compute Engine VMs.
Create GKE clustergcloud container clusters create hello-cluster --num-nodes=1 # Standard cluster gcloud compute instances list # (1) gcloud container clusters create-auto hello-cluster # (2) gcloud container clusters describe hello-cluster
- If a default zone is set, an Autopilot cluster won't be able to be created without explicitly specifying --region.
Save a Kubernetes cluster's credentials to a kubeconfig.
gcloud container clusters get-credentials my-cluster
Windows Server
-
Windows Server 2016 supports Windows Server Containers and Hyper-V Containers, which create a separate copy of the operating system kernel for each container. The "Containers" feature must be installed on Windows Server 2016 hosts, and to create Hyper-V containers the Hyper-V role must also be installed (although the Hyper-V management tools are not necessary if VMs are not going to created). Windows container hosts need to have Windows installed to C:.
Nano Server once could serve as Docker hosts, but no longer; Nano Servers are now intended to be deployed as containers themselves.
The Powershell Docker module has been deprecated for years.
Commands
cgconfig.service
- cgconfig, which is a part of the libcgroup package, can be used to run at start time to reestablish predefined cgroups.
kubectl
-
Show available contexts
kubectl config get-contexts
Switch to a different contextkubectl config use-context $NAMESPACE kubectl config use $NAMESPACE
Display resourceskubectl get nodes kubectl get pods kubectl get deployments
kubectl describe nodes/gke-*4ff6f64a-6f4v
Execute a command on a pod with only a single container, returning outputkubectl exec $pod -- env
Get a shell to a running containerkubectl exec --stdin --tty $pod -- /bin/bash
When a pod contains more than one container, the container must be specified with -c/--container.
kubectl exec --stdin --tty $pod --container $container -- /bin/bash
kubectl run nginx --image=nginx kubectl delete pod nginx
kubectl create deployment nginx --image=nginx
Number of replicas can be set on creation of a deployment by specifying an argument to --replicas
kubectl create deployment nginx --image=nginx --replicas=4
Replica count is set in an existing deployment by scaling
kubectl scale deploy/nginx --replicas=2
Expose a port
kubectl expose deployment/nginx --port=80 --type=LoadBalancer
List Kubernetes objects
kubectl api-resources
Get a description of a resource
kubectl explain nodes.status.addresses.address
podman
-
On RHEL, podman can be installed as a package or as part of a module
dnf module install container-tools
With few exceptions, podman exposes a command-line API that closely imitates that of Docker.
Arch Linux
On Arch, /etc/subuid and /etc/subgid have to be set. These are colon-delimited files that define the ranges for namespaced UIDs and GIDs to be used by each user. Conventionally, these ranges begin at 100,000 (for the first, primary user) and contain 65,536 possible values.
terry:100000:65536 alice:165536:65536
Then podman has to be migrated
podman system migrate
Podman supports pulling from various repos using aliases that are defined in /etc/containers/registries.conf.d. RHEL and derivative distributions support additional aliases, some of which reference images that require a login.
For example, Red Hat offers a Python 2.7 runtime from the RHSCL (Red Hat Software Collections) repository on registry.access.redhat.com, which does not require authentication. However, Python 3.8 is only available from registry.redhat.io, which does. Interestingly, other Python runtimes are available from the ubi7 and ubi8 repos from unauthenticated registries.
Container images are stored in ~/.local/share/containers/storage.
podman pull rhscl/httpd-24-rhel7 # (1)
- Alias to registry.access.redhat.com/rhscl/httpd-24-rhel7
The Z option is necessary on SELinux systems (like RHEL and derivatives) and tells Podman to label the content with a private unshared label. On systems running SELinux, rootless containers must be explicitly allowed to access bind mounts. Containerized processes are not allowed to access files without a SELinux label.
podman run -d -v=/home/jasper/notes/site:/usr/share/nginx/html:Z -p=8080:80 --name=notes nginx podman run -d -v=/home/jasper/notes/site:/usr/local/apache2/htdocs:Z -p=8080:80 --name=notes httpd-24
Mapped ports can be displayed
podman port -a
Output a SystemD service file from a container to STDOUT (this must be redirected to a file)
podman generate systemd notes \ --restart-policy=always \ --name \ # (3) --files \ # (2) --new \ # (1)
- Yield unit files that do not expect containers and pods to exist but rather create them based on their configuration files.
- Generate a file with a name beginning with the prefix (which can be set with --container-prefix or --pod-prefix) and followed by the ID or name (if --name is also specified)
- In conjunction with --files, name the service file after the container and not the ID number.
systemd-cgls
- systemd-cgls recursively shows the contents of the selected cgroup hierarchy in a tree.
Glossary
- apiVersion
-
Kubernetes object field found in Type metadata.
apiVersion is typically v1, but for some object types the API group is specified, i.e. for Deployments:
apiVersion: apps/v1
Dockerfile
-
A Docker image consists of read-only layers, each of which represents an instruction that incrementally the changes the image being built up. Dockerfiles can be used to construct new images using
docker build
. The build process can be optimized by placing multiple commands in the sameRUN
instruction. Dockerfiles are named simply "Dockerfile" with no extension or variation.FROM alpine RUN apk update && apk add nodejs COPY . /app WORKDIR /app CMD ["node","index.js"]
FROM microsoft/windowsservercore RUN powershell -command install-windowsfeature dhcp -includemanagementtools RUN powershell -configurationname microsoft.powershell -command add-dhcpserverv4scope -state active -activatepolicies $true -name scopetest -startrange 10.0.0.100 -endrange 10.0.0.200 -subnetmask 255.255.255.0 RUN md boot COPY ./bootfile.wim c:/boot/ CMD powershell
FROM microsoft/windowsservercore MAINTAINER @mike_pfeiffer RUN powershell.exe -Command Install-WindowsFeature Web-Server COPY ./websrc c:/inetpub/wwwroot CMD [ "powershell" ]
- Deployment
- A uniformly managed set of Pod instances, all based on the same container image. The Deployment controller enables release capabilities, the deployment of new Pod versions with no downtime. Exposing a Deployment creates a Service.
- Desired State Management
- The Desired State Management system is used by Kubernetes to describe a cluster's desired state declaratively.
- emptyDir
- Ephemeral Kubernetes volume type that shares the Pod's lifetime and where data is stored in RAM. emptyDir volumes can use tmpfs file systems.
- ENTRYPOINT
-
Rarely used Docker declaration. When ENTRYPOINT is present, the CMD declaration becomes the default argument passed to the command in ENTRYPOINT.
The Kubernetes --command flag (
pod.spec.containers.command
resource) can override the contents of ENTRYPOINT. - etcd
- Distributed key-value data store
- Event
- Kubernetes object type that contains information about what happened to the object.
Events are deleted one hour after creation by default.
Unlike most other objects, Event manifests have no spec or status sections.
kubectl get events kubectl get ev
Helm
-
Helm is a package manager for Kubernetes.
Helm packages are refered to as charts. Charts are a collection of files and directories that adhere to a specification. A chart is packed when tarred and gzipped.
- Chart.yaml contains metadata
- templates/ contains Kubernetes manifests potentially annotated with templating directives
- values.yaml provides default configuration
It is managed using the helm CLI utility.
Create a new charthelm create foo
There is no longer a default Helm repository, although there are many available at the Artifact Hub
- kind
- Kubernetes object field found in the Type metadata which specifies the type of resource, i.e. Node, Deployment, Service, Pod, etc.
- kubeconfig
- YAML configuration file located at $HOME/.kube/config by default.
A colon-delimited list of kubeconfigs can be specified by setting the
KUBECONFIG
environment variable. A kubeconfig can be explicitly specified with the --kubeconfig flag. - Label
-
Labels are key-value pairs that are attached to Kubernetes objects.
Config for a Pod with two labels:
apiVersion: v1 kind: Pod metadata: name: label-demo labels: environment: production app: nginx spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80
- Master node
-
A master node runs 3 processes, called master (control plane) components:
- kube-apiserver exposes a RESTful API and serves as a glue between other Kubernetes components
- kube-scheduler determines how to balance container workloads across nodes using an algorithm
- kube-controller-manager performs cluster operations like managing nodes and making changes to desired status
- millicore (m)
- One-thousandth of a vCPU or a CPU core and the preferred measurement unit of compute resources in Kubernetes (i.e. 128m = 12.8% of a CPU core and 2000m = 2 CPU cores).
Namespaces
-
Namespaces wrap global system resources (mount points, network devices, hostnames) in an abstraction that makes it appear to processes within that namespace that they have their own isolated instance of that resource.
Process IDs in the same namespace can have access to one another, whereas those in different namespaces cannot. Spawning a process in a new namespace prevents it from seeing the host's context, so an interactive shell like
zsh
spawned in its own namespace will report its PID as1
, even though the host will assign its own PID. - Node
-
A node or worker is any container host that accepts workloads from the master node. Each node is equipped with a container runtime like Docker, which it uses to create and destroy containers according to instructions from the master server.
Each node runs 2 processes:
- kubelet communicates with Kubernetes cluster services
- kube-proxy handles container network routing using iptables rules
- PersistentVolume
- A PersistentVolume is a piece of storage in the cluster that has been provisioned using Storage Classes.
- PersistentVolumeClaim
- A PersistentVolumeClaim requests either Disk or File storage of a particular StorageClass, access mode, and size. It is bound to a PersistentVolume once an available storage resource has been assigned to the pod requesting it.
- Pod
-
A pod is the most basic unit that K8s deals with, representing one or more tightly-coupled containers that should be controlled as a single application (typically one main container with subsidiary helper containers). Every container should have only a single process, so if several processes need to communicate they should be implemented as separate containers in a pod.
A pod's containers should:
- operate closely together
- share a lifecycle
- always be scheduled on the same node
- Replica
- An instance of a Pod
- ReplicaSet
- ...
- Selector
-
A label selector provides a way to identify a set of objects and is the core grouping primitive supported by Kubernetes. It can be made of multiple requirements that are comma-separated, all of which must be satisfied.
There are two types of selector:
- Equality-based admits the operators =, !=, and ==.
- Set-based admits the operators in, notin, and exists.
environment = production tier != frontend
environment in (production, qa) tier notin (frontend, backend) partition !partition
- Service
- A Service is an abstraction over a logical set of Pods and a policy by which to access them, i.e. a microservice. Because Pods are mortal, the Service controller keeps track of Pod addresses and publishes this information to the consumers of Services, a function called service discovery.
- tmpfs
- RAM-backed file system used in Docker containers
- Volume
-
A volume is a special directory in the Docker host that can be mounted to the container that is used to achieve persistent storage.
In Azure, a volume represents a way to store, retrieve, and persist data across pods and through the application lifecycle. In the context of Azure, Kubernetes can use two types of data volume:
- Azure Disks using Azure Premium (SSDs) or Azure Standard (HDDs).
- Azure Files using a SMB 3.0 share backed by an Azure Storage account.
In Kubernetes, Volumes are an abstraction of file systems accessible from within a Pod's containers.
- Network storage device, such as
gcePersistentVolume
emptyDir
, where the data is stored in RAM using Docker's tmpfs file systemhostPath
, where the volume is located within the node's file system. Because pods are expected to be created and destroyed on any node (which may themselves be destroyed and recreated), hostPath volumes are discommended.
Volumes are declared in .spec.volumes and mounted into containers in .spec.containers[*].volumeMounts.
apiVersion: v1 kind: Pod metadata: name: alpine spec: volumes: - name: data emptyDir: containers: - name: alpine image: alpine volumeMounts: - mountPath: "/data" name: "data"
apiVersion: v1 kind: Pod metadata: name: alpine spec: volumes: - name: data hostPath: path: /var/data containers: - name: alpine image: alpine volumeMounts: - mountPath: "/data" name: "data"
apiVersion: v1 kind: Pod metadata: name: alpine spec: volumes: - name: data gcePersistentDisk: pdName: my-disk fsType: ext4 containers: - name: alpine image: alpine volumeMounts: - mountPath: "/data" name: "data"
- Worker :material-kubernetes
- see Node