kernel

Primer to Container Security

Ankur Kothiwal — Thu, 05 May 2022 16:00:00 +0000

Containers are considered to be a standard way of deploying these microservices to the cloud. Containers are better than virtual machines in almost all ways except security, which may be the main barrier to their widespread adoption.

This article will provide a better understanding of container security and available techniques to secure them.

A Linux container can be defined as a process or a set of processes running in the userspace that is/are isolated from the rest of the system by different kernel tools.

Containers are great alternatives to virtual machines (VMs). Even though containers and virtual machines provide the same isolation benefits, they differ in the way that containers provide operating system virtualization instead of hardware. This makes them lightweight, faster to start, and consumes less memory.

As multiple containers share the same kernel, the solution is less secure than the VMs, where they have their copies of OS, libraries, dedicated resources, and applications. That makes VM excellently secure but because of their high storage size and reduced performance, it creates a limitation on the total number of VMs which can be run simultaneously on a server. Further VMs take a lot of time to boot.

The introduction of microservice architecture has changed the way of developing software. Microservices allow the development of software in small self-contained independent services. This makes the application easier to scale and provides agility.

If a part of the software needs to be rewritten it can easily be done by changing only that part of the code without interrupting any other service, which wasn't possible with the monolithic kernel.

Protection requirement use cases and solutions

1) Linux Kernel Features

a. Namespaces

Namespaces ensure the isolation of resources for processes running in a container to that of others. They partition the kernel resources for different processes. One set of processes in a separate namespace will see one set of resources while another set of processes will see another. Processes in different see different process IDs, hostnames, user IDs, file names, names for network access, and some interprocess communication. Hence, each file system namespace has its private mount table and root directory.

Go to Full Article

Oops! Debugging Kernel Panics

Petros Koutoupis — Wed, 07 Aug 2019 23:30:00 +0000

by Petros Koutoupis

A look into what causes kernel panics and some utilities to help gain more information.

Working in a Linux environment, how often have you seen a kernel panic? When it happens, your system is left in a crippled state until you reboot it completely. And, even after you get your system back into a functional state, you're still left with the question: why? You may have no idea what happened or why it happened. Those questions can be answered though, and the following guide will help you root out the cause of some of the conditions that led to the original crash.

Figure 1. A Typical Kernel Panic

Let's start by looking at a set of utilities known as kexec and kdump. kexec allows you to boot into another kernel from an existing (and running) kernel, and kdump is a kexec-based crash-dumping mechanism for Linux.

Installing the Required Packages

First and foremost, your kernel should have the following components statically built in to its image:


CONFIG_RELOCATABLE=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_DEBUG_INFO=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_PROC_VMCORE=y

You can find this in /boot/config-`uname -r`.

Make sure that your operating system is up to date with the latest-and-greatest package versions:


$ sudo apt update && sudo apt upgrade

Install the following packages (I'm currently using Debian, but the same should and will apply to Ubuntu):


$ sudo apt install gcc make binutils linux-headers-`uname -r`
 ↪kdump-tools crash `uname -r`-dbg

Note: Package names may vary across distributions.

During the installation, you will be prompted with questions to enable kexec to handle reboots (answer whatever you'd like, but I answered "no"; see Figure 2).

Figure 2. kexec Configuration Menu

And to enable kdump to run and load at system boot, answer "yes" (Figure 3).

Figure 3. kdump Configuration Menu

Configuring kdump

Open the /etc/default/kdump-tools file, and at the very top, you should see the following:

Go to Full Article

Documenting Proper Git Usage

Zack Brown — Wed, 07 Aug 2019 22:30:00 +0000

by Zack Brown

Jonathan Corbet wrote a document for inclusion in the kernel tree, describing best practices for merging and rebasing git-based kernel repositories. As he put it, it represented workflows that were actually in current use, and it was a living document that hopefully would be added to and corrected over time.

The inspiration for the document came from noticing how frequently Linus Torvalds was unhappy with how other people—typically subsystem maintainers—handled their git trees.

It's interesting to note that before Linus wrote the git tool, branching and merging was virtually unheard of in the Open Source world. In CVS, it was a nightmare horror of leechcraft and broken magic. Other tools were not much better. One of the primary motivations behind git—aside from blazing speed—was, in fact, to make branching and merging trivial operations—and so they have become.

One of the offshoots of branching and merging, Jonathan wrote, was rebasing—altering the patch history of a local repository. The benefits of rebasing are fantastic. They can make a repository history cleaner and clearer, which in turn can make it easier to track down the patches that introduced a given bug. So rebasing has a direct value to the development process.

On the other hand, used poorly, rebasing can make a big mess. For example, suppose you rebase a repository that has already been merged with another, and then merge them again—insane soul death.

So Jonathan explained some good rules of thumb. Never rebase a repository that's already been shared. Never rebase patches that come from someone else's repository. And in general, simply never rebase—unless there's a genuine reason.

Since rebasing changes the history of patches, it relies on a new "base" version, from which the later patches diverge. Jonathan recommended choosing a base version that was generally thought to be more stable rather than less—a new version or a release candidate, for example, rather than just an arbitrary patch during regular development.

Jonathan also recommended, for any rebase, treating all the rebased patches as new code, and testing them thoroughly, even if they had been tested already prior to the rebase.

"If", he said, "rebasing is limited to private trees, commits are based on a well-known starting point, and they are well tested, the potential for trouble is low."

Moving on to merging, Jonathan pointed out that nearly 9% of all kernel commits were merges. There were more than 1,000 merge requests in the 5.1 development cycle alone.

Go to Full Article

Another Episode of "Seems Perfectly Feasible and Then Dies"--Script to Simplify the Process of Changing System Call Tables

Zack Brown — Wed, 07 Aug 2019 20:45:00 +0000

by Zack Brown

David Howells put in quite a bit of work on a script, ./scripts/syscall-manage.pl, to simplify the entire process of changing the system call tables. With this script, it was a simple matter to add, remove, rename or renumber any system call you liked. The script also would resolve git conflicts, in the event that two repositories renumbered the system calls in conflicting ways.

Why did David need to write this patch? Why weren't system calls already fairly easy to manage? When you make a system call, you add it to a master list, and then you add it to the system call "tables", which is where the running kernel looks up which kernel function corresponds to which system call number. Kernel developers need to make sure system calls are represented in all relevant spots in the source tree. Renaming, renumbering and making other changes to system calls involves a lot of fiddly little details. David's script simply would do everything right—end of story no problemo hasta la vista.

Arnd Bergmann remarked, "Ah, fun. You had already threatened to add that script in the past. The implementation of course looks fine, I was just hoping we could instead eliminate the need for it first." But, bowing to necessity, Arnd offered some technical suggestions for improvements to the patch.

However, Linus Torvalds swooped in at this particular moment, saying:

Ugh, I hate it.

I'm sure the script is all kinds of clever and useful, but I really think the solution is not this kind of helper script, but simply that we should work at not having each architecture add new system calls individually in the first place.

IOW, we should look at having just one unified table for new system call numbers, and aim for the per-architecture ones to be for "legacy numbering".

Maybe that won't happen, but in the _hope_ that it happens, I really would prefer that people not work at making scripts for the current nasty situation.

And the portcullis came crashing down.

It's interesting that, instead of accepting this relatively obvious improvement to the existing situation, Linus would rather leave it broken and ugly, so that someone someday somewhere might be motivated to do the harder-yet-better fix. And, it's all the more interesting given how extreme the current problem is. Without actually being broken, the situation requires developers to put in a tremendous amount of care and effort into something that David's script could make trivial and easy. Even for such an obviously "good" patch, Linus gives thought to the policy and cultural implications, and the future motivations of other people working in that region of code.

Note: if you're mentioned above and want to post a response above the comment section, send a message with your response text to ljeditor@linuxjournal.com.

Go to Full Article

Simplifying Function Tracing for the Modern GCC

Zack Brown — Fri, 26 Jul 2019 11:30:00 +0000

by Zack Brown

Steven Rostedt wanted to do a little housekeeping, specifically with the function tracing code used in debugging the kernel. Up until then, the kernel could enable function tracing using either GCC's -pg flag or a combination of -pg and -mfentry. In each case, GCC would create a special routine that would execute at the start of each function, so the kernel could track calls to all functions. With just -pg, GCC would create a call to mcount() in all C functions, although with -pg coupled with -mfentry, it would create a call to fentry().

Steven pointed out that using -mfentry was generally regarded as superior, so much so that the kernel build system always would choose it over the mcount() alternative by testing GCC at compile time to see if it actually supported that command-line argument.

This is all very normal. Since any user might have any version of a given piece of software in the toolchain, or a variety of different CPUs and so on, each with different capabilities, the kernel build system runs many tests to identify the best available features that the kernel will be able to rely on.

But in this case, Steven noticed that for Linux version 4.19, Linus Torvalds had agreed to bump the minimum supported GCC version to 4.6. Coincidentally, as Steven now pointed out, GCC version 4.6 was the first to support the -mfentry argument. And, this was his point—all supported versions of GCC now supported the better function tracing option, and so there was no need for the kernel build system to cling to the mcount() implementation at all.

Steven posted a patch to rip it out by the roots.

Peter Zijlstra gave his support for this plan, as did Jiri Kosina. And, Jiri in particular spat upon the face of the mcount() solution.

Linus also liked Steven's patch, and he pointed out that with mcount() out of the picture, there were several more areas in the kernel that had existed simply to help choose between mcount() and fentry(), and that those now also could be removed. But Steven replied that, although yes this should be done, he still wanted to do split it up into a separate patch, for cleanliness' sake.

Go to Full Article

Extending the Kernel with Built-in Kernel Headers

Joel Fernandes — Wed, 24 Jul 2019 11:30:00 +0000

by Joel Fernandes

Note: this article is a followup to Zack Brown's "Android Low Memory Killer—In or Out?"

Linux kernel headers are the unstable, constantly-changing, internal API of the kernel. This includes internal kernel structures (for example, task_struct) as well as helper macros and functions. Unlike the UAPI headers used to build userspace programs that are stable and backward-compatible, the internal kernel headers can change at any time and any release. While this allows the kernel unlimited flexibility to evolve and change, it presents some difficulties for code that needs to be loaded into the kernel at runtime and executed in kernel context.

Kernel modules are a prime example of such code code. They execute in kernel context and depend on this same unstable API that can change at any time. A module has to be built for the kernel it is running on and may not load on another because of an internal API change could break it. Another example is eBPF tracing programs. These programs are dynamically compiled from C to eBPF, loaded into the kernel and execute in kernel space in an in-kernel BPF virtual machine. Since these programs trace the kernel, they need to use the in kernel API at times, and they have the same challenges as kernel modules as far as internal API changes go. They may need to understand what data structures in the kernel look like or call kernel helper functions.

Kernel headers are usually unavailable on the target where these BPF tracing programs need to be dynamically compiled and run. That is certainly the case with Android, which runs on billions of devices. It is not practical to ship custom kernel headers for every device. My solution to the problem is to embed the kernel headers within the kernel image itself and make it available through the sysfs virtual filesystem (usually mounted at /sys) as a compressed archive file (/sys/kernel/kheaders.tar.xz). This archive can be uncompressed as needed to a temporary directory. This simple change guarantees that the headers are always shipped with the running kernel.

Several kernel developers disagreed with the solution; however, kernel maintainer Greg Kroah-Hartman was supportive of the solution as were many others. Greg argued that the solution is simple and just works as did other kernel developers. Linus pulled the patches in v5.2 of the kernel release.

To enable the embedded kernel headers, build your kernel with CONFIG_KHEADERS=y kernel option, or =m if you want to save some memory.

The rest of this article looks at challenges with kernel headers, solutions and the limitations.

Challenges with Kernel Headers

Filesystem or Archive?

Go to Full Article

What Does It Take to Make a Kernel?

Petros Koutoupis — Tue, 23 Jul 2019 12:00:00 +0000

by Petros Koutoupis

The kernel this. The kernel that. People often refer to one operating system's kernel or another without truly knowing what it does or how it works or what it takes to make one. What does it take to write a custom (and non-Linux) kernel?

So, what am I going to do here? In June 2018, I wrote a guide to build a complete Linux distribution from source packages, and in January 2019, I expanded on that guide by adding more packages to the original guide. Now it's time to dive deeper into the custom operating system topic. This article describes how to write your very own kernel from scratch and then boot up into it. Sounds pretty straightforward, right? Now, don't get too excited here. This kernel won't do much of anything. It'll print a few messages onto the screen and then halt the CPU. Sure, you can build on top of it and create something more, but that is not the purpose of this article. My main goal is to provide you, the reader, with a deep understanding of how a kernel is written.

Once upon a time, in an era long ago, embedded Linux was not really a thing. I know that sounds a bit crazy, but it's true! If you worked with a microcontroller, you were given (from the vendor) a specification, a design sheet, a manual of all its registers and nothing more. Translation: you had to write your own operating system (kernel included) from scratch. Although this guide assumes the standard generic 32-bit x86 architecture, a lot of it reflects what had to be done back in the day.

The exercises below require that you install a few packages in your preferred Linux distribution. For instance, on an Ubuntu machine, you will need the following:

binutils
gcc
grub-common
make
nasm
xorriso

An Extreme Crash Course into the Assembly Language

Note: I'm going to simplify things by pretending to work with a not-so-complex 8-bit microprocessor. This doesn't reflect the modern (and possibly past) designs of any commercial processor.

Go to Full Article

Shrinking Linux Attack Surfaces

Zack Brown — Thu, 18 Jul 2019 11:00:00 +0000

by Zack Brown

Often, a kernel developer will try to reduce the size of an attack surface against Linux, even if it can't be closed entirely. It's generally a toss-up whether such a patch makes it into the kernel. Linus Torvalds always prefers security patches that really close a hole, rather than just give attackers a slightly harder time of it.

Matthew Garrett recognized that userspace applications might have secret data that might be sitting in RAM at any given time, and that those applications might want to wipe that data clean so no one could look at it.

There were various ways to do this already in the kernel, as Matthew pointed out. An application could use mlock() to prevent its memory contents from being pushed into swap, where it might be read more easily by attackers. An application also could use atexit() to cause its memory to be thoroughly overwritten when the application exited, thus leaving no secret data in the general pool of available RAM.

The problem, Matthew pointed out, came if an attacker was able to reboot the system at a critical moment—say, before the user's data could be safely overwritten. If attackers then booted into a different OS, they might be able to examine the data still stored in RAM, left over from the previously running Linux system.

As Matthew also noted, the existing way to prevent even that was to tell the UEFI firmware to wipe system memory before booting to another OS, but this would dramatically increase the amount of time it took to reboot. And if the good guys had won out over the attackers, forcing them to wait a long time for a reboot could be considered a denial of service attack—or at least downright annoying.

Ideally, Matthew said, if the attackers were only able to induce a clean shutdown—not simply a cold boot—then there needed to be a way to tell Linux to scrub all data out of RAM, so there would be no further need for UEFI to handle it, and thus no need for a very long delay during reboot.

Matthew explained the reasoning behind his patch. He said:

Unfortunately, if an application exits uncleanly, its secrets may still be present in RAM. This can't be easily fixed in userland (eg, if the OOM killer decides to kill a process holding secrets, we're not going to be able to avoid that), so this patch adds a new flag to madvise() to allow userland to request that the kernel clear the covered pages whenever the page reference count hits zero. Since vm_flags is already full on 32-bit, it will only work on 64-bit systems.

Matthew Wilcox liked this plan and offered some technical suggestions for Matthew G's patch, and Matthew G posted an updated version in response.

Go to Full Article

Address Space Isolation and the Linux Kernel

Zack Brown — Wed, 10 Jul 2019 11:30:00 +0000

by Zack Brown

Mike Rapoport from IBM launched a bid to implement address space isolation in the Linux kernel. Address space isolation emanates from the idea of virtual memory—where the system maps all its hardware devices' memory addresses into a clean virtual space so that they all appear to be one smooth range of available RAM. A system that implements virtual memory also can create isolated address spaces that are available only to part of the system or to certain processes.

The idea, as Mike expressed it, is that if hostile users find themselves in an isolated address space, even if they find bugs in the kernel that might be exploited to gain control of the system, the system they would gain control over would be just that tiny area of RAM to which they had access. So they might be able to mess up their own local user, but not any other users on the system, nor would they be able to gain access to root level infrastructure.

In fact, Mike posted patches to implement an element of this idea, called System Call Isolation (SCI). This would cause system calls to each run in their own isolated address space. So if, somehow, an attacker were able to modify the return values stored in the stack, there would be no useful location to which to return.

His approach was relatively straightforward. The kernel already maintains a "symbol table" with the addresses of all its functions. Mike's patches would make sure that any return addresses that popped off the stack corresponded to entries in the symbol table. And since "attacks are all about jumping to gadget code which is effectively in the middle of real functions, the jumps they induce are to code that doesn't have an external symbol, so it should mostly detect when they happen."

The problem, he acknowledged, was that implementing this would have a speed hit. He saw no way to perform and enforce these checks without slowing down the kernel. For that reason, Mike said, "it should only be activated for processes or containers we know should be untrusted."

There was not much enthusiasm for this patch. As Jiri Kosina pointed out, Mike's code was incompatible with other security projects like retpolines, which tries to prevent certain types of data leaks falling into an attacker's hands.

There was no real discussion and no interest was expressed in the patch. The combination of the speed hit, the conflict with existing security projects, and the fact that it tried to secure against only hypothetical security holes and not actual flaws in the system, probably combined to make this patch set less interesting to kernel developers.

Go to Full Article

Deprecating a.out Binaries

Zack Brown — Tue, 25 Jun 2019 12:00:00 +0000

by Zack Brown

Remember a.out binaries? They were the file format of the Linux kernel till around 1995 when ELF took over. ELF is better. It allows you to load shared libraries anywhere in memory, while a.out binaries need you to register shared library locations. That's fine at small scales, but it gets to be more and more of a headache as you have more and more shared libraries to deal with. But a.out is still supported in the Linux source tree, 25 years after ELF became the standard default format.

Recently, Borislav Petkov recommended deprecating it in the source tree, with the idea of removing it if it turned out there were no remaining users. He posted a patch to implement the deprecation. Alan Cox also remarked that "in the unlikely event that someone actually has an a.out binary they can't live with, they can also just write an a.out loader as an ELF program entirely in userspace."

Richard Weinberger had no problem deprecating a.out and gave his official approval of Borislav's patch.

In fact, there's a reason the issue happens to be coming up now, 25 years after the fact. Linus Torvalds pointed out:

I'd prefer to try to deprecate a.out core dumping first....That's the part that is actually broken, no?

In fact, I'd be happy to deprecate a.out entirely, but if somebody _does_ complain, I'd like to be able to bring it back without the core dumping.

Because I think the likelihood that anybody cares about a.out core dumps is basically zero. While the likelihood that we have some odd old binary that is still a.out is slightly above zero.

So I'd be much happier with this if it was a two-stage thing where we just delete a.out core dumping entirely first, and then deprecate even running a.out binaries separately.

Because I think all the known *bugs* we had were with the core dumping code, weren't they?

Removing it looks trivial. Untested patch attached.

Then I'd be much happier with your "let's deprecate a.out entirely" as a second patch, because I think it's an unrelated issue and much more likely to have somebody pipe up and say "hey, I have this sequence that generates executables dynamically, and I use a.out because it's much simpler than ELF, and now it's broken". Or something.

Jann Horn looked over Linus' patch and suggested additional elements of a.out that would no longer be used by anything, if core dumping was coming out. He suggested those things also could be removed with the same git commit, without risking anyone complaining.

Go to Full Article