Linux Containers have become increasingly important over the last few years, in particular driven by Cloud applications. But what are containers, and how can they be useful?
This is an expanded, English-language version of a presentation I had to prepare during my studies. As such there are likely to be some misunderstandings. I am looking forward to be contacted with corrections.
This is a fairly low-level view at container concepts, and not a tutorial for a specific technology like Docker.
A container is just ordinary process, with added security-features. It is therefore important that we start with a quick recap of the Unix process-model and security-model.
Under Unix, every process has various attributes. E.g. these are the ID of the user who is running this process, the current working directory, the priority (nice value), but also resources like memory address space and open file descriptors.
When a new process is started (like with
fork()), it inherits all of these attributes from the parent. So the processes form a tree-structure.
The top process in the process tree is the init process. This is the process that is started by the Kernel upon booting. Then, init can start further processes and services, for example a login shell. Under modern Linux systems, the init system is usually provided by systemd.
Processes communicate with the Kernel via two mechanisms: Firstly over syscalls, a kind of special function call into the Kernel. Secondly over pseudo-filesystems such as
/sys. When a file is read from these directories the Kernel doesn’t open an actual file. Instead, it writes various data and parameters. Which information can be accessed depends on the permissions of the process reading that pseudo-file. Some of these entries are also writeable, so that processes can send data to the kernel.
Unix security model
The Unix permission model is extremely simple. There are two kinds of users: The super-user “root” can do everything. Normal users are restricted.
This permission model is primarily concerned with limiting access to files. Every file belongs to a user and a group. Permissions to read, write, and execute the file can be configured separately for the owning user, the members of the owning group, and for other users.
This permission model is often entirely inadequate, because it is mostly focused on files and not other resources. And even for files it is very limited. E.g. it is not possible to set different permissions for different groups. As a consequence other permission models like SELinux appeared later.
History of Containers
While Linux Containers are a fairly new development, to my surprise the history of containers reaches back many decades.
In 1979 the Unix
chroot() syscall (change root) was implemented. This allows us to use some directory as the new file system root “
/.” Directories outside of that new root are not accessible. So
chroot() allows us to choose a particular view of the filesystem. The root directory is a process attribute that will be inherited by any new processes started from this chrooted process.
Although it is sometimes used that way,
chroot() was never intended as a security feature. Instead, the goal was to create a clean file system within a chroot environment, e.g. to compile software in that clean environment. Many Linux-Distros like Debian use a chroot-based build system to create software packages. Because the chroot environment usually only contains the minimal set of files necessary for a working system, any implicit dependencies on other software are revealed.
In 2000 the
chroot() concept was developed into a
jail() by the FreeBSD project. Jails are intended as a secure, isolated environment, and not just as a isolated view onto the file system. The security checks were expanded from the file system to all syscalls, so that a jail appears similar to a virtual machine. This monolithic all-or-nothing approach is much simpler than comparable Linux features.
But jails also have a number of restrictions. For example, jails do not have their own networking stack but only an individual IP address. Jails cannot mount their own file systems. And since all jails on a system share a kernel, it is not possible to load kernel modules from within a jail.
Solaris Zones appeared in 2005. They use a similar approach like FreeBSD Jails, in the sense that zones present a monolithic security solution. But there are a number of improvements. Zones are deeply integrated with the ZFS filesystem. Because ZFS supports copy-on-write (CoW), zones can be cloned quickly, without consuming much space at all. Zones can also mount their own filesystems. Another big improvement is that zones can perform virtual networking among each other. A unique feature of Solaris is that the Kernel can emulate the syscall interface of other kernels. This could be an earlier Solaris release, or even a Linux-Kernel. This allows a Solaris 10 host to run Solaris 9 zones.
Linux does not offer a monolithic container solution such as jails or zones, but has various security-features that can be combined to implement containerisation.
The most important of these are Namespaces, most of which were merged into the Linux Kernel in 2006. Namespaces offer an isolated view onto various Kernel resources. This is crucial for any container implementation, and will be discussed later in more detail.
Between 2006 and 2008 Google developed CGroups (Control Groups) for the Linux Kernel. With CGroups, various priorities and resource limits can be managed for a group of processes. For example, this could be the amount of available memory, or a portion of CPU timeslices.
With Namespaces and CGroups implemented, all the features necessary for Linux containers were in place. The LXC project (LinuX Containers) implemented one containerisation solution out of those parts. LXC offers a low-level collection of tools to manage containers.
In 2013 Docker was published. Depending on the configuration, Docker actually builds upon LXC or other container implementations, but offers a simpler, high-level user interface. The following year Kubernetes appeared, which is a container cluster orchestration software started by Google.
Different Virtualisation for different use cases
These different approaches to virtualisation and containerisation do not all have the same goals, but cover a whole range of use cases.
For hardware virtualisation or para-virtualisation, fully fledged virtual machines are the only solution. Para-virtualisation describes a scenario where the guest system was modified to communicate with the hypervisor via a special virtualisation interface, rather than requiring the hypervisor to emulate a firmware interface.
This kind of virtualisation has the advantage that there is a very clear boundary between the guest system and hypervisor, which lends itself well to security-sensitive applications, including public-cloud offerings. It also allows the guest system to use a different operating system than the host, e.g. Linux on Windows or vice versa.
If operating system virtualisation is required, then Jails, Zones, or Containers can be used. Container-based VMs are in many cases not distinguishable from real VMs, in particular if the container is running a real init system. But compared to VMs there are also many additional restrictions since containers share the Kernel with their host.
It is often not necessary to virtualise the whole operating system. Instead, we often just want to isolate a particular application. If this isolation is intended to enforce a security boundary, then Containers or SELinux can be used. If the focus lies on dependency isolation and configuration management, then Containers, chroot environments, but also application containers like Snap or Flatpak are viable approaches.
So depending on how they are configured, containers can address many different problems.
What are containers?
A Linux container is an ordinary process that combines various security features of the Linux-Kernel:
Container = Processes + Namespaces + CGroups + ⋯
In particular, this are the following features:
With Namespaces a process can obtain its own copy of various Kernel objects. So these namespaces offer isolation with regards to specific Kernel systems. The available namespaces are:
- Mount – separate file system mount table
- IPC – separate IPC resources such as shared memory
- UTS – individual hostname
- PID – process visibility and separate PID ranges
- Net – separate networking
- User – separate users
- cgroups – separate cgroup hierarchies
The Mount, PID, Net, and User namespaces are discussed later in more detail.
To be precise, a namespace does not always perform an actual copy. Instead they are abstractions so that changes to a global kernel resource are only visible to other members of that namespace.
The cgroup (control group) mechanism allows detailed resource limits for process groups. The cgroup namespace can isolate these policies from another. For example, cgroups can specify how many CPU timeslices are available to that container.
Capabilities are a fine-grained description of the actions which the root user can perform. Usually, root is completely unrestricted. In containers this is not desirable because these capabilities could be used for privilege escalation. Therefore, the container process drops unneeded capabilities. The capabilities are checked by the kernel on each syscall.
SELinux  and AppArmor  are access control mechanisms with a much more fine grained permission model than the default Unix permission model. In particular, they can limit which resources (file system paths, ports, …) can be accessed by an application.
Chroot may still be a component of a container implementation, though its functionality is generally subsumed by Mount-Namespaces.
Seccomp can filter the allowed syscalls in a process. With seccomp, dangerous syscalls can be denied within a container. In the
STRICT configuration the process is left with only four syscalls:
_exit() to terminate itself,
sigreturn() to receive signals, and the ability to
write() to previously opened file descriptors. This allows untrusted code to be executed within a sandbox. Aside from Linux containers, this is also used by the Google Chrome browser to isolate renderer and plugin processes.
The Kernel also contains a number of file systems that are useful for containers. Union-filesystems can overlay multiple file systems which allows parts of container images to be shared.
To start a containerized process, the container implementation will take the following steps:
A new process is started, typically with the
clone()syscall. Clone is similar to a Fork, but is Linux-specific. The flags to this function allow new namespaces to be created.
The new process configures its security environment. This involves mounting the desired filesystems, setting up the correct users, and finally dropping unneeded privileges. In this step Namespaces, Capabilities, and cgroups are utilized.
Afterwards, the actual containerized process can be started. It inherits the container environment that was just configured. This new process would then be the init process for the container.
There are also some alternatives. For example with
setns() an existing process can join an existing namespace.
A Mount Namespace creates a copy of the mount table. The mount table describes which file systems are mounted as which directories. The current table can be viewed under
/proc directory is itself the mount point for the “proc” pseudo-filesystem.
At first, this namespace looks similar to a chroot environment. The difference is that a chroot environment just changes the view onto an existing directory structure. In contrast, a Mount Namespace can freely change which file systems are mounted where without affecting processes outside of the namespace.
Container images are just file systems. The image can be mounted as the new root filesystem inside a mount namespace. If this image is mounted as read-only, this file system can be shared between multiple container instances. This image reuse can save significant amounts of storage (and bandwidth when transferring images).
Many processes will nevertheless have to create or write files within their containers. They need a writeable file system for that. In some cases an ephemeral file system like tmpfs can mounted over some directory. But that hides the original contents of the directory.
Union file systems like AUFS or OverlayFS present a solution. These combine a lower layer with a (read-only) file system and an upper layer with a second file system.
The upper layer only contains files that are changed relative to the base layer. When a file is read, it is opened from the upper layer if the file exists there, otherwise from the lower layer. When a file is written, the write always goes to the upper layer. That may require to first copy the file up from the lower layer. The upper layer can also contain whiteouts that mark a lower-layer file as deleted. With all these operations the lower layer is never modified.
Drawbacks of union file systems are the decreased write performance since changed files have to be copied up from the lower layer first. There are also some details that make union file systems not POSIX compatible, e.g. OverlayFS cannot rename files, and opening the same file for reading and writing may actually access different files (on the lower and upper layer, respectively).
Alternatively, copy on write (CoW) file systems such as Btrfs and ZFS can be used. With a CoW filesystem, a file or directory can be copied without creating a physical copy, similar to a hardlink. But unlike with hardlinks, a copy is created upon modifications, whether that modification happened through the original name or through the copy’s name, Because this happens within a file system this can be much more efficient than OverlayFS, e.g. by performing CoW on a block-level rather than on a filename level.
With both union file systems and CoW file systems, base images can be shared efficiently and stacked over each other. That also means that changes to an upper layer or a CoW copy can be discarded because the unmodified base image still exists.
This encourages the use of immutable images: instead of a long-running virtual machine that is updated over time, we do not modify a running container’s image. When the image needs to be changed (e.g. updating software), then a new image with those changes is created and deployed.
A consequence is that containers should not store persistent data within their images. They have to store their data externally. This could be a bind mount in Docker, where a directory of the host system is mounted within the container directory structure. In most cases, a containerized service will store persistent data in a database instead.
These immutable images are a crucial feature of Linux containers. When we later discuss use cases for containers, this immutability will be a recurring theme.
The Net-Namespace isolates the networking stack. Every namespace can have its own isolated interfaces, IP addresses, ports, routes, and firewalls. An interface always belongs to exactly one namespace, so it is not possible for multiple containers to share the same physical interface such as
Instead of passing ownership of physical interfaces to containers, it is more common to configure a virtual network. The host creates a network bridge out of two virtual interfaces, one of which is given to the container. The host can then act as a router between the container’s network and outside networks.
The User Namespace defines a mapping between users within the container and users on the host system. In doing so, the container’s root user is mapped to a nobody user on the host. If a root process were to escape from the container, it would have no permissions on the host.
An important benefit of user namespaces is the ability to strip capabilites from the container’s root user. Whereas the host’s root user is allowed to do anything, the container’s root should only be permitted actions that affect nothing outside of the container. Capabilities are checked on each privileged operation, e.g. a syscall.
The PID of a process is unique only within a PID Namespace. The first process (init) always has PID = 1. That is the case both for the host system init, and for the first process within a container. The container can be started with a normal init-system, but usually only a single, ordinary process is executed within the container context.
Processes can only see each other within the same PID namespace. This is necessary e.g. to send signals (
kill()). When we list the processes within a container with
top, we will only see the processes from that container.
The PID Namespaces form a tree structure: every process is not just a member of it’s immediate namespace, but also of all outer namespaces. In each outer namespace, the process has a different PID. So the init process with PID=1 within a container may actually have PID=1234 on the host system. The containerized processes can’t see processes from outer namespaces. But outside processes can see the containerized processes and send them signals.
Limitations of Namespaces
Some Kernel resources are not isolated with namespaces.
In particular, system time is global, and cannot be changed just for one container. This is irrelevant in a production environment, because correct time is very important, e.g. for crypto. Time should always be received from NTP servers. But in a test scenario it would be very helpful to start a container with a wrong time, e.g. to check behaviour around leap seconds.
The Kernel Keyring with various keys and security data and also the Syslog are global. The Syslog may contain sensitive information that could aid an escaping process. Loading kernel modules would change the kernel for the whole system and could inject privileged code.
These not-containerizable aspects must therefore be denied to the container, by withdrawing the corresponding capabilities.
Unlike Jails or Zones, Namespaces offer many isolation options. On the one hand this high complexity can lead to security problems. On the other hand this flexibility allows many interesting use cases.
It is possible to join an existing namespace. The different namespaces like Mount or Net are generally independent from each other.
For example a web server can run in a container with its own namespaces. It is then possible to start another container with a Wireshark process, where the Wireshark container shares the Net Namespace with the server container. Wireshark can then access the same network interfaces as the server and listen in on the communication.
Instead of every user having to combine these namespaces themselves, the container engine such as Docker provides a namespace profile. Docker in particular then offers many options to fine-tune this profile. So starting from reasonable defaults, you can fine tune the container configuration to match your specific circumstances.
A full example of a namespace and capabilities configuration is presented in the article Linux Containers in 500 lines of code by Lizzie Dixon. 
Case Study: 12-Factor App
Under what circumstances and for what goals can containers be used effectively? To investigate that, I looked at the 12-Factor App.
“The Twelve-Factor App” is a collection of best practices for the development of Web Apps or SaaS offerings. But it also seems applicable for internal services. The characteristic feature is that the software is developed and deployed by the same organization.
The goal of these best practices is simple scalability of the software, and a minimal difference between production and development environments. This allows quick and easy deployments, up to Continuous Deployment.
The 12 factors originated from the experience of Heroku, a cloud platform. So they might be biased towards cloud applications. But this collection allows us to discuss how containers can help with a DevOps process. I’ll only discuss the relevant points, not all twelve factors.
Explicitly declare and isolate dependencies
The application should not expect any software to be pre-installed. Instead, all dependencies should be made explicit so that they can be automatically resolved, or should bundle all dependencies.
In the simplest case this can be done through static linking, so that the application is a single executable file that contains all files. In some cases, the executable may even contain files and other resources. It’s also possible to use package managers or write installation scripts that can fetch and install the dependencies.
Container images can also be a solution. The image includes all necessary files, libraries, and services that are required by the application. Because the image is built from a bare-bones system, missing dependencies can be revealed early, before the image is deployed to the production servers.
Store config in the environment
Configuration in the sense of the 12-factor app is everything that differs between different execution environments (dev, staging, prod). This could be authentication keys for assisting services, hostnames under which the application should run, or URLs for backend-services.
This config should not be stored in a config file, because such a file would have to be edited before each deployment. Instead, the config should be provided implicitly be the environment, e.g. as environment variables. In particular, the deployment mechanism is now responsible for providing the correct config.
Under Docker a part of this configuration can be provided by linking containers with each other. In a Kubernetes cluster, the config would be provided by etcd instead.
Providing config externally is necessary when using container images as a deployment artefact. If you were to update the image with new config values prior to deployment, you would no longer be deploying the same artifact that was tested previously.
Treat backing services as attached resources.
Here, services could be databases, message queues, email gateways, or external providers such as Amazon S3 or Twitter.
The goal is to represent all services via an URL. This makes services easily replaceable. The difference between local and external services disappears, and it becomes less important whether a service is provided internally or by an external provider.
Because containers are isolated from each other, communication over a network is the primary mechanism to connect containers. So a containerized system will satisfy this recommendation by itself. In particular, containers are well suited for implementing a microservice architecture.
Strictly separate build and run stages
The 12-Factor App recommends that build, release, and run are conceptually distinct stages of your deployment process.
- In the build step, an executable system is created from the source code. This resolves all dependencies.
- In the release step, a build is combined with the configuration for a particular environment.
- In the run step, a release is executed in the target environment.
Note that a build is distinct from a release because configuration should be pulled from the environment. I.e. the build will not contain any configuration. Also, the release step is distinct from the run step as you may launch multiple processes of the same release.
The goal of this clear distinction is that you never change the code at run time. Logging into a production server to fix a bug there is completely unacceptable.
A benefit of this approach is that you can easily roll back to a previous release. All releases should be assigned a version number and stored permanently.
This can be implemented easily with immutable images, as encouraged by Docker. The build step would correspond to the
docker build command that combines all dependencies into a runnable image. The release step corresponds to selecting a set of arguments to
docker run, for example linking a container with a production database. In the run step, one or more container instances of this image with that configuration can be launched. Whenever the deployed release should change, a new container is launched rather than updating the deployed image.
Execute the app as one or more stateless processes
All persistent data belongs into a database or another external service, and should not be stored as part of the application. Non-persistent data such as local caches are OK, as long as they are not treated as a source of truth.
This is necessary if you want to run multiple processes in parallel. By storing the data externally, restarting a process or deploying a new version also becomes much easier.
Again, this corresponds very closely to the concept of immutable images. Because changes to the image are discarded, it is not possible to store any persistent data within the image. Instead, external services or volume mounts have to be used.
Implement services as self-contained servers
Services and apps should not be developed as plugins or modules for another system, but should be independent processes of their own. These processes can then communicate over network protocols such as HTTP, REST, or message queues.
Examples for modules or plugins would be servlets in a Tomcat server, or an Apache module. Similarly, copying some PHP files to an Apache server would be considered problematic because the app is not running as its own process, but as a part of Apache.
The goal here isn’t just that the apps are easy to launch, without having to fuss around with any plugin configuration. The big benefit of communicating only via the network is that an app can be reused as a backing service of another app, i.e. that a service-oriented architecture becomes possible.
Container are exactly just processes that can only communicate over the network with each other. So deploying a service as an otherwise isolated container satisfies this recommendation.
Scale out via the process model
A lot of software can scale internally, e.g. by launching multiple threads. But there are limits as to how far this can scale.
Instead, the software should assume that multiple instances of the service are running in parallel. A load balancer can then distribute requests over all instances.
This requires a share-nothing architecture. A process cannot depend on internal locks. Instead any global state must be stored in a common database. Every individual process is then stateless.
Such a scalable system cannot scale itself. Instead, an external process manager decides how many instances of each service should be running at any time. This could be scheduling/orchestration software such as Kubernetes that launches container instances as required.
Instances should be quick to start and quick to shut down. This encourages frequent and painless deployments. Additionally, the system gains resilience against failures, e.g. if a process crashes.
It is helpful to perform external state changes transactionally: they either happen completely or not at all. Every incoming job must be completed exactly once. If a process stops during a transaction, no work gets lost and the job can be run again when the process is restarted. When done consistently, this could even be a crash-only design.
Containers can be useful because they can potentially launch faster than a virtual machine. But fundamentally this point is more about the internal architecture of a software than its execution environment. A stateless architecture as discussed above helps here.
Keep development, staging, and production as similar as possible
The objective here is to get smooth deployments, up to Continuous Deployment. But if the development environment is too different from production, then deployments carry much risk and possibly require extensive adjustements. But if all environments are very similar then problems can be found early and fixed more easily.
An important aspect is that services in all environments use the same software in the same version. So don’t use SQLite as a test database when production runs on PostreSQL.
Containerisation can help here, by providing all services as a container image. The same same image can then be run by ops for production and by devs for testing. Differences between the software as developed and the software as deployed can be avoided by deploying through immutable images.
The idea here is that bugs are cheapest to fix when they are detected early, i.e. during development. A bug in production can easily be much more costly than the minor hassle of installing these services locally, especially if all the developer has to do is a
Conclusion: Why are containers important?
In this article, various properties and use cases of containers were discussed. In summary, this points to three important use cases:
- containers as a lightweight virtual machine,
- containers for configuration management, and
- container images as a deployment-artefact.
We also have to distinguish how containers are useful for single users, in contrast to cluster management.
Containers as a lightweight VM
Compared to VMs, containers tend to launch faster because they don’t have to boot a complete operating system. Of course it’s possible to run a complete init system inside a container, but typically only a single process runs in the container.
Containers also need less resources than a VM, in particular less memory.
For a cluster, this translates into a higher deployment density than with VMs, at a roughly comparable level of isolation. Depending on the size of the containers, this could result in 30% lower costs. On a suitably large scale, those are quite dramatic savings.
A host can also overcommit its CPU resources, so that more processing time is guaranteed to the containers than is actually available in aggregate. This is sensible when the containers only need heavy processing sporadically. This can enable further savings.
For single users, containers might just make virtualisation possible in the first place. Many devices do not have the resources (particularly memory) to run one or more VMs. But since containers are just processes, even the smallest Linux system can run containers (e.g. a Raspberry Pi).
Containers as configuration management.
Containers are attractive for configuration management because dependencies can be isolated easily. In particular, host systems don’t have to be “contaminated” with dependencies, and a containerized application cannot interfere with the host system through incompatibilities. Containers can set detailed security policies.
Cluster administration can be simplified through an approach like Immutable Infrastructure. All servers run exactly the same configuration: a container runtime. On top of those servers, different containers can be scheduled. So the running systems are never modified, only tasked with running different container images.
Images as deployment artefact
When images are used as the deployment format, this is quite comfortable for single users because the installation of dependencies is no longer required. A containerized app can therefore be easy to use (once it’s wired up with the correct external services & storage).
For a cluster the uniform administration enabled by container images is a huge advantage: All apps are provided in a common container format. The cluster can then be scheduled via Kubernetes or a similar orchestration software. This software decides how many instances of which container are executed on which node.
As a consequence, containers can then be scaled automatically. In response to increased demand, new container instances can be started in the cluster. The identity of the host running a particular service becomes unimportant, because all nodes provide the same environment. These servers are cattle, not pets: numbered and replaceable, not lovingly nourished and cherished.
Even though containers are an important and versatile instrument, they have many limitations that must not be forgotten.
Containers do not offer perfect isolation. Obviously, all containers on a host share the same Kernel. This requires some additional security restrictions. It also means that any host should only run containers of one organization, not containers from different organizations that don’t trust each other.
Containers only virtualise processes and some kernel resources, but not physical resources such as persistent storage. These resources still have to be provided externally, for example via a Storage Area Network.
Containers offer a wealth of configuration options. That’s not just flexibility, but also complexity that can lead to security-relevant errors.
Containers entice to build a service-oriented architecture. But not every application needs the cloud. Distributed systems are inherently more complex than a monolithic system, and this complexity is often not needed. Even in a cloud scenario some processes cannot be scaled arbitrarily, for example databases or load balancers. Instead of a scale-out approach where more parallel processes are started, a scale-up approach with more performant hardware can also be quite sensible.
Setting up a container infrastructure is not trivial. This requires completely new skillsets from sysadmins. And it doesn’t help that the container ecosystem is still fairly young. Some components are not very mature. The documentation might be spotty or nonexistent. Especially around Kubernetes, some people still think it’s acceptable when “the code is the documentation.”
In particular, it is much more difficult to set up a production-grade container infrastructure than it is to
docker run an image locally. Security, performance, and reliability don’t appear magically, but are the result of careful work. This doesn’t imply that containers are a bad solution, only that a move to containers also has costs and risks. It may therefore be attractive to pay for managed container infrastructure from a public cloud provider.
But when you are aware of these limitations, containers still are a very valuable and versatile tool that can play a crucial role in a DevOps context.