GO’CIRCUIT
A project by Petar Maymounkov.

Closed Circuit for Linux (RFC)

Wed, Jun 5, 2013
The Closed Circuit for Linux is a piece of software that can turn a cluster of Linux machines into a single powerful Linux super-machine. This document is a very preliminary proposal and a Request For Comments.

Problem

The most common situation, in regards to scaled computing, that companies and institutions find themselves in today has to do with transitioning from a single-computer stack (or an ad-hoc multi-computer stack) to an "industrial grade", cost-effective, sustainable cloud infrastructure.

While the circuit provides powerful tools for building new distributed software, it alone does not address the question of compatibility with legacy software.

Consider, as an example, a company with a large pre-existing custom software footprint. For instance, a typical numerical computation shop would have a large repository of Python and MatLab scripts that, executed concurrently, interact in complex ways via standard POSIX mechanisms (file-system, UNIX sockets, network sockets, pipes, etc.).

It is of great value to be able to take such a pre-existing code base and, without change, be able to deploy or execute it on a machine cluster by distributing the workload at process granularity across multiple machines. And while doing so, it is imperative that system administrators retain clear control over placement of data as well as utilization patterns of network connections.

We entail here on a path to solve this problem, which is enabled by the circuit as well as recent technologies like Linux Containers.

Solution

Linux Containers, LXC for short, are a mechanism for creating an isolated Linux OS environment, which is a subset of an enclosing Linux environment, additionally constrained by customizable resource limits. An LxC container is a "subset" in that it sees a sandboxed file-system which is isolated (by copy-on-write) from and fully-configurable as a subset of the enclosing file-system in broad sense. An entire instance of Linux, together with any user-space applications, can run within an LxC container unaware of its parent system and operating under specified resource constraints with respect to the resources of the enclosing system. Multiple LXC containers can run side-by-side, isolated from each other. By definition, LXC containers can themselves create sub-containers.

We propose a system, the Closed Circuit for Linux, whereby a cluster of Linux hosts can be harnessed into a single Linux supercomputer, able to run standard Linux binaries unchanged.

In particular, while the main goal of this project is to be able to distribute user-space processes across multiple machines, all processes themselves are to have the perception that they execute on the same shared Linux box. And, by implication, standard system administration utilities (like the bash shell e.g.), are to have scripting (or other form of control) abilities over the entire supercomputer unchanged.

To accomplish this, user-space processes are to be executed inside carefully built LXC containers on the various cluster hosts, in a manner that presents non-contradicting views (in time and space) over the global state of the supercomputer. Since most cross-process communication in modern software occurs through the file system (including the special cases of UNIX sockets and pipes) and since we are able to fully intercept those accesses within an LXC container, we are able to equip every executing user-space process — regardless of its physical location — with an enclosing Linux environment that:

Furthermore, the LXC containers running user processes must ensure, by specification, that they present the same semantic and reliability guarantees across the supercomputer Linux instance, as one would expect in a traditional single-computer Linux instance. In particular,

Finally, we elect that we will not conceal (or try to compensate for) the events of cluster host failure. While some applications could benefit from such a heavy-weight feature, the latter is presumptuous in many cases. Luckily, there is a natural way to present host failures within the Linux semantics of the supercomputer OS.

Supercomputer enclosure

Understanding the surrounding Linux environment, from the point of view of a running user process, is perhaps the best way to get a concrete feel of the proposed system. Each user space process executes within a Linux system (its patent LXC container) which presents a certain complete view of the system to the user process. This view is effectively entirely controlled by the contents of the file system presented to the user process.

In this section, we describe an example layout of the LXC file system that demonstrates our main points. It should be understood that the actual layout can easily be changed.

As a starting point, each LXC container gets a root file system, /, which is pre-populated with a custom subset of the file system of the enclosing OS. The purpose of extracting this "subset" is to expose all installed software that might be needed within the container and by the user process. For this purpose, one could elect to map the enclosing directories /bin, /sbin, /usr and /home to identically-named directories within the root of the LXC sandbox file system.

This mapping accommodates a natural use pattern, whereby participating Linux hosts are pre-installed with necessary application software so that the LXC containers of the closed circuit can automatically "inherit" the available software.

In order to enable collabortion among processes, we propose a few simple additional substructures to the file system.

First, we subscribe that the system should be simple and transparent. In way of achieving this, every LXC container file system contains an identical root directory /host, whose subdirectories are named after the hosts participating in the cluster, like so:

/host/ludlow-10-123-70-1.gocircuit.org/
/host/ludlow-10-123-70-2.gocircuit.org/
/host/delancey-10-88-0-2.gocircuit.org/
…

Each host-specific subdirectory will contain subdirectories exposing various resources that live on that host.

For example, each host can elect to expose a portion of its local HDD file-system for access from any other host. We would mount access to this space via a /disk subdirectory, e.g.

/host/ludlow-10-123-70-1.gocircuit.org/disk/

To give visibility into running processes, we include a /proc subdirectory which is an effective replica of the original Linux procfs file system, sanitized to include only relevant processes, e.g.

/host/ludlow-10-123-70-1.gocircuit.org/proc/

To expose access to a desired subset of a host's /dev file system, we include the /dev subdirectory, e.g.

/host/ludlow-10-123-70-1.gocircuit.org/dev/

And, finally, to enable the creation of UNIX sockets and pipes, we include a /pipe subdirectory. New pipes, created by user-space processes, are to be added to a subdirectory /pipe/server of the host directory where they execute. When pipes are connected, an informative connection special files are to appear automatically in /pipe/client within the host directories of both endpoint.

This idea of enabling collaboration through the LXC container file system could be taken much further, but let us end here since we have a sufficient example so far.

To tie the system together, we expect to have a specialized interactive shell prompt whose main purpose is to execute new user processes on specified hosts or in specified LXC containers. Since traditional shells, like bash, could be executed on any host and they could utilize the file system to gain visibility into and write-based control over the entire supercomputer, the specialized shell need not provision much else in addition to process execution.

The formalism proposed here is quite preliminary. A design of this kind obviously bares resemblance to systems like Plan9, and we expect to be able to borrow additional ideas from prior work of this kind.

Architecture

Below we illustrate the software stack of the Closed Circuit for Linux for a compute cluster with a pair of hosts.

Illustration of the Closed Circuit for Linux architecture.

The hosts themselves are hardware, and they are equipped to communicate over the physical/link-layer transport of their communication devices. These communication channels are prone to not only failure (aka erasure) but also error (quietly altered messages).

On top of each host sits a traditional Linux distribution, which harnesses the hardware capabilities of the host and, in software, implements a "reliable" communication primitive, which is known as TCP. The misleading term "reliable", as applied to TCP, constitutes standard terminology in the systems literature. Nevertheless, TCP is not as reliable as a UNIX socket, for example, and we need to make this distinction concrete.

TCP is reliable in the sense that throughout the life of a connection it ensures that data will not be lost or altered. However, it does not ensure that the life of the connection will not be interrupted prematurely by external factors, as long as the endpoint processes are alive. In contrast, a UNIX socket is reliable in the same sense as TCP, in addition to guaranteeing that as long as the two processes at the end of the socket are alive, the socket will stay alive as well.

On every host, we designate special user-space worker processes, responsible for creating, overseeing and destroying LXC containers, as well as for executing user-space binaries within them. Every worker process is capable of hosting one or more user processes within its LXC container. Multiple worker processes can be deployed within the same host OS. Since each worker corresponds to an individual LXC container and each such container can have custom resource limits and isolation, co-locating workers on the same host can be used for resource isolation. For example, on a multi-core machine, one worker could be allowed to use one half of the cores, and another — the other half.

Amongst themselves worker processes communicate using a TCP-based protocol that has the strong reliability guarantees of UNIX sockets. We call it a high fidelity transport. To accomplish uninterrupted connections in the presence of physical network outages, when a network connection is compromised the high fidelity transport blocks (as far as its user-space endpoints can see) on reads and writes, instead of declaring an error. When network service is resumed, reading and writing resumes as well. Out-of-band management of the high-fidelity networking system is also envisioned, whereby administrators can force connection failures in some cases, for example, when they desire to take a system down on purpose.

User processes execute in their respective containers, chosen for them by the executing administrator, or otherwise assigned automatically with generic load balancing in mind. When pairs of user processes located on different hosts communicate, by accessing special files (like UNIX pipe files or other device files) on their local LXC file systems, their communication is transparently routed through the high fidelity transport layer among the workers.

Implementation

The role of the circuit in realizing the proposed solution is as an implementation technology. Albeit a detail at first glance, it is an important one. Below is another metaphorical illustration, trying to capture the place of the circuit within our proposed system implementation.

Illustration of the Closed Circuit for Linux architecture.

We propose that the workers be implemented as Go Circuit workers, which are responsible for:

Fortunately a recent open-source project, Docker, has made it remarkably easy to manage LXC containers from within the Go programming environment.

There are various benefits from using the circuit. First, cross-worker communication can directly benefit from the circuit transport, which is itself high fidelity. Second, the proposed system needs a distributed locking service which is provided by the circuit framework by default. Third, the circuit makes extending the worker logic easy and hassle-free.

This last concern is particularly important in light of the fact that the proposed system needs to be instrumented with non-trivial monitoring and logging services (at the least), as described later on. Such extensions would be fairly complex to implement (and improve incrementally) without the fluidity of the circuit programming environment.

Plug-and-play installation and sustenance

It is out intention that this system be elementary to install and fully-instrumented for lifetime sustenance.

Without going into much detail, the only configuration required by the closed circuit is the list of hostnames that comprise the supercomputer.

When launched, the system would spawn a worker — ready to receive execution requests — on each host in the cluster. However, in order for such system to be easily sustained, a few non-trivial background services must be in place:

More generally, the implementation of this system should be a closed circuit, or in other words it should be self-sustained without any need for administrator intervention.

Comments? Please join the Google+ discussion. A read-only copy follows: