Memory

Data in a Flash, Part IV: the Future of Memory Technologies

Petros Koutoupis — Fri, 19 Jul 2019 11:30:00 +0000

I have spent the first three parts of this series describing the evolution and current state of Flash storage. I also described how to configure an NVMe over Fabric (NVMeoF) storage network to export NVMe volumes across RDMA over Converged Ethernet (RoCE) and again over native TCP. [See Petros' "Data in a Flash, Part I: the Evolution of Disk Storage and an Introduction to NVMe", "Data in a Flash, Part II: Using NVMe Drives and Creating an NVMe over Fabrics Network" and "Data in a Flash, Part III: NVMe over Fabrics Using TCP".]

But what does the future of memory technologies look like? With traditional Flash technologies that are enabled via NVMe, you should continue to expect higher capacities. For instance, what comes after QLC or Quad-Level Cells NAND technology? Only time will tell. The next-generation NVMe specification will introduce a protocol standard operating across more PCI Express lanes and at a higher bandwidth. As memory technologies continue to evolve, the method in which you plug that technology into your computers will evolve with it.

Remember, the ultimate goal is to move closer to the CPU and reduce access times (that is, latencies).

Figure 1. The Data Performance Gap as You Move Further Away from the CPU

Storage Class Memory

For years, vendors have been developing a technology in which you are able to plug persistent memory into traditional DIMM slots. Yes, these are the very same slots that volatile DRAM also uses. Storage Class Memory (SCM) is a newer hybrid storage tier. It's not exactly memory, and it's also not exactly storage. It lives closer to the CPU and comes in two forms: 1) traditional DRAM backed by a large capacitor to preserve data to a local NAND chip (for example, NVDIMM-N) and 2) a complete NAND module (NVDIMM-F). In the first case, you retain DRAM speeds, but you don't get the capacity. Typically, a DRAM-based NVDIMM is behind the latest traditional DRAM sizes. Vendors such as Viking Technology and Netlist are the main producers of DRAM-based NVDIMM products.

The second, however, will give you the larger capacity sizes, but it's not nearly as fast as DRAM speeds. Here, you will find your standard NAND—the very same as found in modern Solid State Drives (SSDs) fixed onto your traditional DIMM modules.

Go to Full Article

Crazy Compiler Optimizations

Zack Brown — Thu, 23 May 2019 11:30:00 +0000

by Zack Brown

Kernel development is always strange. Andrea Parri recently posted a patch to change the order of memory reads during multithreaded operation, such that if one read depended upon the next, the second could not actually occur before the first.

The problem with this was that the bug never could actually occur, and the fix made the kernel's behavior less intuitive for developers. Peter Zijlstra, in particular, voted nay to this patch, saying it was impossible to construct a physical system capable of triggering the bug in question.

And although Andrea agreed with this, he still felt the bug was worth fixing, if only for its theoretical value. Andrea figured, a bug is a bug is a bug, and they should be fixed. But Peter objected to having the kernel do extra work to handle conditions that could never arise. He said, "what I do object to is a model that's weaker than any possible sane hardware."

Will Deacon sided with Peter on this point, saying that the underlying hardware behaved a certain way, and the kernel's current behavior mirrored that way. He remarked, "the majority of developers are writing code with the underlying hardware in mind and so allowing behaviours in the memory model which are counter to how a real machine operates is likely to make things more confusing, rather than simplifying them!"

Still, there were some developers who supported Andrea's patch. Alan Stern, in particular, felt that it made sense to fix bugs when they were found, but that it also made sense to include a comment in the code, explaining the default behavior and the rationale behind the fix, even while acknowledging the bug never could be triggered.

But, Andrea wasn't interested in forcing his patch through the outstretched hands of objecting developers. He was happy enough to back down, having made his point.

It was actually Paul McKenney, who had initially favored Andrea's patch and had considered sending it up to Linus Torvalds for inclusion in the kernel, who identified some of the deeper and more disturbing issues surrounding this whole debate. Apparently, it cuts to the core of the way kernel code is actually compiled into machine language. Paul said:

We had some debates about this sort of thing at the C++ Standards Committee meeting last week.

Pointer provenance and concurrent algorithms, though for once not affecting RCU! We might actually be on the road to a fix that preserves the relevant optimizations while still allowing most (if not all) existing concurrent C/C++ code to continue working correctly. (The current thought is that loads and stores involving inline assembly, C/C++ atomics, or volatile get their provenance stripped. There may need to be some other mechanisms for plain C-language loads and stores in some cases as well.)

Go to Full Article

CGroup Interactions

Zack Brown — Tue, 14 May 2019 12:00:00 +0000

by Zack Brown

CGroups are under constant development, partly because they form the core of many commercial services these days. An amazing thing about this is that they remain an unfinished project. Isolating and apportioning system elements is an ongoing effort, with many pieces still to do. And because of security concerns, it never may be possible to present a virtual system as a fully independent system. There always may be compromises that have to be made.

Recently, Andrey Ryabinin tried to fix what he felt was a problem with how CGroups dealt with low-memory situations. In the current kernel, low-memory situations would cause Linux to recuperate memory from all CGroups equally. But instead of being fair, this would penalize any CGroup that used memory efficiently and reward those CGroups that allocated more memory than they needed.

Andrey's solution to this was to have Linux recuperate unused memory from CGroups that had it, before recuperating any from those that were in heavy use. This would seem to be even less fair than the original behavior, because only certain CGroups would be targeted and not others.

Andrey's idea garnered support from folks like Rik van Riel. But not everyone was so enthralled. Roman Gushchin, for example, pointed out that the distinction between active and unused memory was not as clear as Andrey made it out to be. The two of them debated this issue quite a bit, because the whole issue of fair treatment hangs in the balance. If Andrey's whole point is to prevent CGroups from "gaming the system" to ensure more memory for themselves, then the proper approach to low-memory conditions depends on being able to identify clearly which CGroups should be targeted for reclamation and which should be left alone.

At the same time, the situation could be seen as a security concern, with an absolute need to protect independent CGroups from each other. If so, something like Andrey's patch would be necessary, and many more security-minded developers would start to take an interest in getting the precise details exactly right.

Note: if you're mentioned above and want to post a response above the comment section, send a message with your response text to ljeditor@linuxjournal.com.

Go to Full Article

Handling Complex Memory Situations

Zack Brown — Wed, 20 Mar 2019 12:00:00 +0000

by Zack Brown

Jérôme Glisse felt that the time had come for the Linux kernel to address seriously the issue of having many different types of memory installed on a single running system. There was main system memory and device-specific memory, and associated hierarchies regarding which memory to use at which time and under which circumstances. This complicated new situation, Jérôme said, was actually now the norm, and it should be treated as such.

The physical connections between the various CPUs and devices and RAM chips—that is, the bus topology—also was relevant, because it could influence the various speeds of each of those components.

Jérôme wanted to be clear that his proposal went beyond existing efforts to handle heterogeneous RAM. He wanted to take account of the wide range of hardware and its topological relationships to eek out the absolute highest performance from a given system. He said:

One of the reasons for radical change is the advance of accelerator like GPU or FPGA means that CPU is no longer the only piece where computation happens. It is becoming more and more common for an application to use a mix and match of different accelerator to perform its computation. So we can no longer satisfy our self with a CPU centric and flat view of a system like NUMA and NUMA distance.

He posted some patches to accomplish several different things. First, he wanted to expose the bus topology and memory variety to userspace as a clear API, so that both the kernel and user applications could make the best possible use of the particular hardware configuration on a given system. A part of this, he said, would have to take account of the fact that not all memory on the system always would be equally available to all devices, CPUs or users.

To accomplish all this, his patches first identified four basic elements that could be used to construct an arbitrarily complex graph of CPU, memory and bus topology on a given system.

These included "targets", which were any sort of memory; "initiators", which were CPUs or any other device that might access memory; "links", which were any sort of bus-type connection between a target and an initiator; and "bridges", which could connect groups of initiators to remote targets.

Aspects like bandwidth and latency would be associated with their relevant links and bridges. And, the whole graph of the system would be exposed to userspace via files in the SysFS hierarchy.

Go to Full Article

NVMe over Fabrics Support Coming to the Linux 4.8 Kernel

Petros Koutoupis — Mon, 22 Aug 2016 16:15:00 +0000

by Petros Koutoupis

The Flash Memory Summit recently wrapped up its conferences in Santa Clara, California, and only one type of Flash technology stole the show: NVMe over Fabrics (NVMeF). From the many presentations and company announcements, it was obvious NVMeF was the topic that most interested the attendees.

With the first industry specifications announced in 2011, Non-Volatile Memory Express (NVMe) quickly rose to the forefront of Solid State Drive (SSD) technologies. Historically, SSDs were built on top of Serial ATA (SATA), Serial Attached SCSI (SAS) and Fibre Channel buses. These interfaces worked well for the maturing Flash memory technology, but with all the protocol overhead and bus speed limitations, it did not take long for these drives to experience performance bottlenecks. Today, modern SAS drives operate at 12 Gbit/s, while modern SATA drives operate at 6 Gbit/s. This is why the technology shifted its focus to PCI Express (PCIe). With the bus closer to the CPU and PCIe capable of performing at increasingly stellar speeds, SSDs seemed to fit right in. Using PCIe 3.0, modern drives can achieve speeds as high as 40 Gbit/s. Leveraging the benefits of PCIe, it was then that the NVMe was conceived. Support for NVMe drives was integrated into the Linux 3.3 mainline kernel (2012).

What really makes NVMe shine over the operating system's SCSI stack is its simpler and faster queueing mechanism. These are called the Submission Queue (SQ) and Completion Queue (CQ). Each queue is a circular buffer of a fixed size that the operating system uses to submit one or more commands to the NVMe controller. One or more of these queues also can be pinned to specific cores, which allows for more uninterrupted operations.

Almost immediately, the PCIe SSDs were marketed for enterprise-class computing with a much higher price tag. Although still more expensive than its SAS or SATA cousins, the dollar per gigabyte of Flash memory continues to drop—enough to convince more companies to adopt the technology. However, there was still a problem. Unlike the SAS or SATA SSDs, NVMe drives did not scale very well. They were confined to the server they were plugged in to.

In the world of SAS or SATA, you have the Storage Area Network (SAN). SANs are designed around SCSI standards. The primary goal of a SAN (or any other storage network) is to provide access of one or more storage volumes across one or more paths to a single or multiple operating system host(s) in a network. Today, the most commonly deployed SAN is based on iSCSI, which is SCSI over TCP/IP. Technically, NVMe drives can be configured within a SAN environment, although the protocol overhead introduces latencies that make it a less than ideal implementation. In 2014, the NVMe Express committee was poised to rectify this with the NVMeF standard.

Go to Full Article