Skip to content

Containers

Containers run applications in an isolated namespace, meaning it only has access to resources that are made available to it by the container runtime. Resource governance means that a container has access only to a specified number of processor cycles, system memory, and other resources. Containers allow applications to be packaged with their dependencies in container images, which will run the same regardless of underlying operating system or infrastructure and are downloaded from container registries like Docker Hub. Container registries are not to be confused with repositories, which are subcomponents of registries.

Cgroups

History

"Task Control Groups" were first merged into kernel 2.6.24 with the ability for multiple hierarchies to be created. The logic behind the creation of multiple hierarchies was to enable maximum flexibility in policy creation. However, because a controller could only belong to a single hierarchy, after some years this began to be seen as a design flaw.

This motivated a redesign into a single unified cgroup hierarchy, and v2 was merged in 3.16 and made stable in 4.5.

Control groups or cgroups are a Linux kernel feature that isolates, labels, and manages resources (CPU time, memory, and network bandwidth) for a collection of tasks (processes).

Cgroup subsystems (also called controllers or resource controllers in documentation) represent a single resource (i.e. io, cpu, memory, devices).

  • freezer suspends or resumes tasks in a cgroup
  • net_cls tags network packets with a class identifier that allows the Linux traffic controller to identify packets originating from a particular cgroup task
  • net_prio allows network traffic to be prioritized per interface
  • ns the namespace subsystem

Since cgroups v2, all mounted controllers reside in a single unified hierarchy. A list of these is generated by the Linux kernel at /proc/cgroups.

You can confirm that the cgroup2 filesystem is mounted at /sys/fs/cgroup

mount -l | grep cgroup -

Kubernetes

History

Kubernetes was first announced by Google in mid-2014. It had been developed by Google after deciding to open-source the Borg system, a cluster and container management system that formed the automation infrastructure that powered the entire Google enterprise. Kubernetes coalesced from a fusion between developers working on Borg and Compute Engine. Borg eventually evolved into Omega.

By that time, Amazon had established a market advantage and the developers decided to change their approach by introducing a disruptive technology to drive the relevance of the Compute platform they had built. They created a ubiquitous abstraction that could run better than anyone else.

At the time, Google had been trying to engage the Linux kernel team and trying to overcome their skepticism. Internally, the project was framed as offering "Borg as a Service", although there were concerns that Google was in danger of revealing trade secrets.

Google ultimately donated Kubernetes to the Cloud Native Computing Foundation.

Kubernetes's heptagonal logo is an allusion to when it was called "Project 7" as a reference to Star Trek's Borg character 7 of 9.

Kubernetes (Greek for "helmsman", "pilot", or "captain" and "k8s" for short) has emerged as the leading container orchestrator in the industry since 2018. It provides a layer that abstracts infrastructure, including computers, networks, and other computers, for applications deployed on top.

Kubernetes can be visualized as a system built from layers, with each higher layer abstracting the complexity of the lower levels. One server serves as the master, exposing an API for users and clients, assigning workloads, and orchestrating communication between other components. The processes on the master node are also called the control plane. Other machines in the cluster are called nodes or workers and accept and run workloads using available resources. A Kubernetes configuration files is called a kubeconfig.

Kubernetes resources or objects, each associated with a URL, represent the configuration of a cluster. Resource and object are often used interchangeably, but more precisely the resource refers to the URL path that points to the object, and an object may be accessible through multiple resources. Every object type in the Kubernetes API has a controller (i.e. deployment controller, etc.) that reads desired state from the Spec section of the manifest and reports its actual state by writing to the Status section.

An object's manifest, presented in JSON or YAML, represents its declarative configuration, and contains four sections:

  • type metadata, specifying the type of resource
  • object metadata, specifying name and other identifying information
  • spec: desired state of resource
  • state: produced strictly by the resource controller and represents the current status of resource

An explanation of each field available in the API of any object type can be displayed on the command-line

kubectl explain nodes
kubectl explain no.spec

Display the manifest of a node
kubectl get node $NODE -o yaml
kubectl describe node kind-worker-2

A pod is the most atomic unit of work which encompasses one or more tightly-coupled containers that will be deployed together on the same node. All containers in a pod share the same Linux namespace, hostname, IP address, network interfaces, and port space. This means containers in a pod can communicate with each other over localhost, although care must be taken to ensure individual containers do not attempt to use the same port. However their filesystems are isolated from one another unless they share a Volume.

Every Pod occupies one address in a shared range, so communication between Pods is simple.

Compute resources of containers can be limited at pod.spec.containers.resources.

apiVersion: v1
kind: Pod
metadata:
    name: nginx
spec:
    containers:
    - image: nginx
        name: nginx
        resources:
        requests:
            memory: "64Mi"
            cpu: "500m"
        limits:
            memory: "128Mi"
            cpu: "500m"

Kubernetes can monitor Pod health by using probes, which can be categorized by how they measure health:

  • Readiness: i.e. Is the container ready to serve user requests?
  • Liveness: i.e. Is the container running as intended?

Tasks

GKE

Google Kubernetes Engine nodes are actually Google Compute Engine VMs.

Create GKE cluster
gcloud container clusters create hello-cluster --num-nodes=1    # Standard cluster
gcloud compute instances list # (1)
gcloud container clusters create-auto hello-cluster # (2)
gcloud container clusters describe hello-cluster

  1. If a default zone is set, an Autopilot cluster won't be able to be created without explicitly specifying --region.

Save a Kubernetes cluster's credentials to a kubeconfig.

gcloud container clusters get-credentials my-cluster

Windows Server

Windows Server 2016 supports Windows Server Containers and Hyper-V Containers, which create a separate copy of the operating system kernel for each container. The "Containers" feature must be installed on Windows Server 2016 hosts, and to create Hyper-V containers the Hyper-V role must also be installed (although the Hyper-V management tools are not necessary if VMs are not going to created). Windows container hosts need to have Windows installed to C:.

Nano Server once could serve as Docker hosts, but no longer; Nano Servers are now intended to be deployed as containers themselves.

The Powershell Docker module has been deprecated for years.

Commands

cgconfig.service

cgconfig, which is a part of the libcgroup package, can be used to run at start time to reestablish predefined cgroups.

kubectl

Show available contexts
kubectl config get-contexts
Switch to a different context
kubectl config use-context $NAMESPACE
kubectl config use $NAMESPACE
Display resources
kubectl get nodes
kubectl get pods
kubectl get deployments
kubectl describe nodes/gke-*4ff6f64a-6f4v
Execute a command on a pod with only a single container, returning output
kubectl exec $pod -- env
Get a shell to a running container
kubectl exec --stdin --tty $pod -- /bin/bash

When a pod contains more than one container, the container must be specified with -c/--container.

kubectl exec --stdin --tty $pod --container $container -- /bin/bash

kubectl run nginx --image=nginx
kubectl delete pod nginx
kubectl create deployment nginx --image=nginx

Number of replicas can be set on creation of a deployment by specifying an argument to --replicas

kubectl create deployment nginx --image=nginx --replicas=4

Replica count is set in an existing deployment by scaling

kubectl scale deploy/nginx --replicas=2

Expose a port

kubectl expose deployment/nginx --port=80 --type=LoadBalancer

List Kubernetes objects

kubectl api-resources

Get a description of a resource

kubectl explain nodes.status.addresses.address

podman

On RHEL, podman can be installed as a package or as part of a module

dnf module install container-tools

With few exceptions, podman exposes a command-line API that closely imitates that of Docker.

Arch Linux

On Arch, /etc/subuid and /etc/subgid have to be set. These are colon-delimited files that define the ranges for namespaced UIDs and GIDs to be used by each user. Conventionally, these ranges begin at 100,000 (for the first, primary user) and contain 65,536 possible values.

terry:100000:65536
alice:165536:65536

Then podman has to be migrated

podman system migrate

Podman supports pulling from various repos using aliases that are defined in /etc/containers/registries.conf.d. RHEL and derivative distributions support additional aliases, some of which reference images that require a login.

For example, Red Hat offers a Python 2.7 runtime from the RHSCL (Red Hat Software Collections) repository on registry.access.redhat.com, which does not require authentication. However, Python 3.8 is only available from registry.redhat.io, which does. Interestingly, other Python runtimes are available from the ubi7 and ubi8 repos from unauthenticated registries.

Container images are stored in ~/.local/share/containers/storage.

podman pull rhscl/httpd-24-rhel7 # (1)

  1. Alias to registry.access.redhat.com/rhscl/httpd-24-rhel7

The Z option is necessary on SELinux systems (like RHEL and derivatives) and tells Podman to label the content with a private unshared label. On systems running SELinux, rootless containers must be explicitly allowed to access bind mounts. Containerized processes are not allowed to access files without a SELinux label.

podman run -d -v=/home/jasper/notes/site:/usr/share/nginx/html:Z -p=8080:80 --name=notes nginx
podman run -d -v=/home/jasper/notes/site:/usr/local/apache2/htdocs:Z -p=8080:80 --name=notes httpd-24

Mapped ports can be displayed

podman port -a

Output a SystemD service file from a container to STDOUT (this must be redirected to a file)

podman generate systemd notes \
    --restart-policy=always   \
    --name                    \ # (3)
    --files                   \ # (2)
    --new                     \ # (1)

  1. Yield unit files that do not expect containers and pods to exist but rather create them based on their configuration files.
  2. Generate a file with a name beginning with the prefix (which can be set with --container-prefix or --pod-prefix) and followed by the ID or name (if --name is also specified)
  3. In conjunction with --files, name the service file after the container and not the ID number.

systemd-cgls

systemd-cgls recursively shows the contents of the selected cgroup hierarchy in a tree.

Glossary

apiVersion

Kubernetes object field found in Type metadata.

apiVersion is typically v1, but for some object types the API group is specified, i.e. for Deployments:

apiVersion: apps/v1

Dockerfile

A Docker image consists of read-only layers, each of which represents an instruction that incrementally the changes the image being built up. Dockerfiles can be used to construct new images using docker build. The build process can be optimized by placing multiple commands in the same RUN instruction. Dockerfiles are named simply "Dockerfile" with no extension or variation.

FROM alpine
RUN apk update && apk add nodejs
COPY . /app
WORKDIR /app
CMD ["node","index.js"]
FROM microsoft/windowsservercore
RUN powershell -command install-windowsfeature dhcp -includemanagementtools
RUN powershell -configurationname microsoft.powershell -command add-dhcpserverv4scope -state active -activatepolicies $true -name scopetest -startrange 10.0.0.100 -endrange 10.0.0.200 -subnetmask 255.255.255.0
RUN md boot
COPY ./bootfile.wim c:/boot/
CMD powershell
FROM microsoft/windowsservercore
MAINTAINER @mike_pfeiffer
RUN powershell.exe -Command Install-WindowsFeature Web-Server
COPY ./websrc c:/inetpub/wwwroot
CMD [ "powershell" ]
Deployment
A uniformly managed set of Pod instances, all based on the same container image. The Deployment controller enables release capabilities, the deployment of new Pod versions with no downtime. Exposing a Deployment creates a Service.
Desired State Management
The Desired State Management system is used by Kubernetes to describe a cluster's desired state declaratively.
emptyDir
Ephemeral Kubernetes volume type that shares the Pod's lifetime and where data is stored in RAM. emptyDir volumes can use tmpfs file systems.
ENTRYPOINT

Rarely used Docker declaration. When ENTRYPOINT is present, the CMD declaration becomes the default argument passed to the command in ENTRYPOINT.

The Kubernetes --command flag (pod.spec.containers.command resource) can override the contents of ENTRYPOINT.

etcd
Distributed key-value data store
Event
Kubernetes object type that contains information about what happened to the object. Events are deleted one hour after creation by default.
kubectl get events
kubectl get ev
Unlike most other objects, Event manifests have no spec or status sections.

Helm

Helm is a package manager for Kubernetes.

Helm packages are refered to as charts. Charts are a collection of files and directories that adhere to a specification. A chart is packed when tarred and gzipped.

  • Chart.yaml contains metadata
  • templates/ contains Kubernetes manifests potentially annotated with templating directives
  • values.yaml provides default configuration

It is managed using the helm CLI utility.

Create a new chart
helm create foo

There is no longer a default Helm repository, although there are many available at the Artifact Hub

kind
Kubernetes object field found in the Type metadata which specifies the type of resource, i.e. Node, Deployment, Service, Pod, etc.
kubeconfig
YAML configuration file located at $HOME/.kube/config by default. A colon-delimited list of kubeconfigs can be specified by setting the KUBECONFIG environment variable. A kubeconfig can be explicitly specified with the --kubeconfig flag.
Label

Labels are key-value pairs that are attached to Kubernetes objects.

Config for a Pod with two labels:

apiVersion: v1
kind: Pod
metadata:
name: label-demo
labels:
    environment: production
    app: nginx
spec:
containers:
- name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80

Master node

A master node runs 3 processes, called master (control plane) components:

  • kube-apiserver exposes a RESTful API and serves as a glue between other Kubernetes components
  • kube-scheduler determines how to balance container workloads across nodes using an algorithm
  • kube-controller-manager performs cluster operations like managing nodes and making changes to desired status
millicore (m)
One-thousandth of a vCPU or a CPU core and the preferred measurement unit of compute resources in Kubernetes (i.e. 128m = 12.8% of a CPU core and 2000m = 2 CPU cores).

Namespaces

Namespaces wrap global system resources (mount points, network devices, hostnames) in an abstraction that makes it appear to processes within that namespace that they have their own isolated instance of that resource.

Process IDs in the same namespace can have access to one another, whereas those in different namespaces cannot. Spawning a process in a new namespace prevents it from seeing the host's context, so an interactive shell like zsh spawned in its own namespace will report its PID as 1, even though the host will assign its own PID.

Node

A node or worker is any container host that accepts workloads from the master node. Each node is equipped with a container runtime like Docker, which it uses to create and destroy containers according to instructions from the master server.

Each node runs 2 processes:

  • kubelet communicates with Kubernetes cluster services
  • kube-proxy handles container network routing using iptables rules
PersistentVolume
A PersistentVolume is a piece of storage in the cluster that has been provisioned using Storage Classes.
PersistentVolumeClaim
A PersistentVolumeClaim requests either Disk or File storage of a particular StorageClass, access mode, and size. It is bound to a PersistentVolume once an available storage resource has been assigned to the pod requesting it.
Pod

A pod is the most basic unit that K8s deals with, representing one or more tightly-coupled containers that should be controlled as a single application (typically one main container with subsidiary helper containers). Every container should have only a single process, so if several processes need to communicate they should be implemented as separate containers in a pod.

A pod's containers should:

  • operate closely together
  • share a lifecycle
  • always be scheduled on the same node
Replica
An instance of a Pod
ReplicaSet
...
Selector

A label selector provides a way to identify a set of objects and is the core grouping primitive supported by Kubernetes. It can be made of multiple requirements that are comma-separated, all of which must be satisfied.

There are two types of selector:

environment = production
tier != frontend
environment in (production, qa)
tier notin (frontend, backend)
partition
!partition
Service
A Service is an abstraction over a logical set of Pods and a policy by which to access them, i.e. a microservice. Because Pods are mortal, the Service controller keeps track of Pod addresses and publishes this information to the consumers of Services, a function called service discovery.
tmpfs
RAM-backed file system used in Docker containers
Volume

A volume is a special directory in the Docker host that can be mounted to the container that is used to achieve persistent storage.

In Azure, a volume represents a way to store, retrieve, and persist data across pods and through the application lifecycle. In the context of Azure, Kubernetes can use two types of data volume:

  • Azure Disks using Azure Premium (SSDs) or Azure Standard (HDDs).
  • Azure Files using a SMB 3.0 share backed by an Azure Storage account.

In Kubernetes, Volumes are an abstraction of file systems accessible from within a Pod's containers.

  • Network storage device, such as gcePersistentVolume
  • emptyDir, where the data is stored in RAM using Docker's tmpfs file system
  • hostPath, where the volume is located within the node's file system. Because pods are expected to be created and destroyed on any node (which may themselves be destroyed and recreated), hostPath volumes are discommended.

Volumes are declared in .spec.volumes and mounted into containers in .spec.containers[*].volumeMounts.

apiVersion: v1
kind: Pod
metadata:
name: alpine
spec:
volumes:
    - name: data
    emptyDir:


containers:
- name: alpine
    image: alpine
    volumeMounts:
    - mountPath: "/data"
        name: "data"
apiVersion: v1
kind: Pod
metadata:
name: alpine
spec:
volumes:
    - name: data
    hostPath:
        path: /var/data

containers:
- name: alpine
    image: alpine
    volumeMounts:
    - mountPath: "/data"
        name: "data"
apiVersion: v1
kind: Pod
metadata:
name: alpine
spec:
volumes:
    - name: data
    gcePersistentDisk:
        pdName: my-disk
        fsType: ext4
containers:
- name: alpine
    image: alpine
    volumeMounts:
    - mountPath: "/data"
        name: "data"
Worker :material-kubernetes
see Node