Muen

An x86/64 Separation Kernel for High Assurance

Checkout source code
This page is served by a MirageOS/Solo5 unikernel running on Muen

τ₀

While separation kernel systems are typically configured statically, there are use cases which require runtime reconfiguration of the system. For Muen, this shall be made possible by the subject Tau0 (τ₀). In contrast to the kernel, Tau0 can configure the system in a non-realtime and non-parallel manner. This reduces complexity both in the kernel and the system configuration logic.

Among others, Tau0 allows to

  • add and remove subjects,

  • modify the page table structures of subjects, kernel instances and VT-d device domains,

  • assign device resources to a subject (interrupts, IO ports, device memory).

For this purpose, Tau0 accepts a command stream, i.e. a stream of instructions that describe how to modify the system’s state. Each command is checked for validity and against the system invariants. It is executed only if the checks are successful, so that each performed command leads again to a sound system state. The following listing shows an example command stream in XML format.

<tau0>
  <commands>

    <addProcessor id="0" apicId="42"/>
    <addIoapic sid="16#f0f8#"/>

    <addMemoryBlock address="16#0000#" size="262144"/>

    <!-- Set up device.
         We can create PCI devices (with B/D/F) and legacy devices. -->
    <createPCIDevice device="0" bus="16#0#" dev="16#1#" func="1" usesMSI="false"/>
    <activateDevice device="0"/>

    <createPCIDevice device="1" bus="16#0#" dev="16#0#" func="0" usesMSI="false"/>
    <!-- Add device IRQs, IO ports, memory.  Device memory is marked
         as reserved. -->
    <addIRQDevice device="1" irq="33"/>
    <addIOPortRangeDevice device="1" from="16#0000#" to="16#000a#"/>
    <addMemoryDevice device="1" address="16#000a_0000#" size="32" caching="WC"/>
    <activateDevice device="1"/>

    <clearPage page="16#2300_0000#"/>
    <clearPage page="16#2300_1000#"/>
    <clearPage page="16#2300_2000#"/>

    <!-- Allocate the VT-d table structures. -->
    <createVTdRootTable page="16#2300_0000#"/>
    <createVTdContextTable page="16#2300_1000#" bus="16#0#"/>

Tau0 has been formally proven to be free of run-time errors; moreover there is work in progress to prove that Tau0 does not violate system invariants (see below).

Modes

Tau0 can be executed in two basic modes:

  • In static mode, Tau0 runs at system integration time (i.e. most likely as a Linux process). In this mode, Tau0 generates a system image from the command stream, which is written to disk and can be booted later on.

  • In dynamic mode, Tau0 runs as a special Muen subject, and may, based on the command stream, modify the runtime state of the system.

This makes the following three scenarios possible:

  • Tau0 integrates the system in static mode, and the system cannot be modified at runtime.

  • Tau0 integrates a minimal Muen system in static mode. At startup, this system comprises only the Muen kernel and a dynamic instance of Tau0. The latter may start other subjects at runtime as described by an included static command stream.

  • As above, but Tau0 is allowed to start and stop other subjects at runtime. In this case, the command stream may originate from some controller subject.

Currently, only static mode is supported.

Command Stream Phases

The command stream processing consists of two distinct phases: setup phase and running phase.

In the setup phase, the basic system configuration is communicated to Tau0. This comprises, i.a., the system hardware configuration. The data supplied in the setup phase is taken to be axiomatically true. This phase is exclusively executed in static mode.

In the running phase, the basic configuration received in the setup phase may no longer be modified. Commands from the command stream are processed in order to modify the state of the system. This phase is part of the dynamic, as well as the static mode.

It is worth reiterating, that in Tau0 static mode, the system state is fixed at integration time and cannot be changed later on.

Data

Tau0 comprises the following runtime data.

Typed Memory

System Memory View

For the purpose of managing the system state, Tau0 has a view on the system memory. The latter is linearly mapped into Tau0’s virtual address space at some offset. In order to prevent and/or detect interference of Tau0 with active parts of the system, subsets of the system memory view can be unmapped from and remapped to Tau0’s virtual address space. We make sure that active pages are in general not accessible (i.e. mapped) by Tau0.

Memory Typization

For each 4KB page of the system memory, Tau0 stores its (basic) typization; i.e. the information whether this page is

  • Undefined — it is free to use and has no defined contents,

  • Zeroed — it is free to use and contains only zero bits,

  • Reserved — it may not be used, read, or written, or

  • Used — it is currently in use, e.g. as a page table, part of a memory region, or otherwise.

The basic typization of a page may be refined to a full typization — for example, the full typization of some page might reveal that the page in question is not just "Used", but a level 2 EPT page table of some subject. Full typizations are not directly computed at runtime (they are SPARK Ghost values), but they figure extensively in contracts and proofs.

Private Pages

There is also the concept of Tau0 private pages: these are pages mapped at an offset that is greater than the maximal physical address supported by the x86 CPU. They are used by static Tau0 to store internal data structures which are not written to the system image. See below for more details.

Devices

Tau0 stores information on the built-in devices of the system. This information comprises the PCI bus/device/function number of the device, and the device-owned interrupts, IO ports, and memory.

The device information is specified in the command stream’s setup phase.

Before the setup phase of Tau0’s command stream can be completed, all created devices must have been activated.

Root Objects

Tau0 also keeps a table of root objects. A root object (or briefly root) is an entity in the system that possesses an own virtual address space. Currently, this comprises

  • Instances of the Muen kernel (one for each CPU)

  • Subjects

  • Memory regions, and

  • VT-d device domains.

Every root object may be active or inactive. An active root object is currently running in the system and may not be modified by Tau0. However, Tau0 can modify the properties, page tables, etc., of inactive root objects.

Just as a root object may be active or inactive, the same holds for the pages and page tables associated to the root object. A page (table) may only be modified by Tau0 if it is inactive. Moreover, a page table may only be active if all pages or page tables it points to are also active.

The fact whether a page (table) P is active is stored in an unused bit of the PTE that points to P.[1] If P is a top-level page table, then the information is instead stored within the corresponding root object (flag PTP_Active).

The lifecycle of a root object is described in the following figure.

                    unused<--------
                      | ^          |
Create (properties)   | |  Destroy |
                      v |          |
                      setup  teardown      (modify root and its page tables)
                          |  ^
            Lock          |  |   Unlock
                          v  |
                         locked            ((de)activate page tables)
                          |  ^
            Activate      |  |   Deactivate
                          v  |
                         active            (may not modify root)

Root Object States

Creation

A root object is created and initially configured with a command of the form Create_*. Depending on the type of the root object, the command expects different parameters.

Destruction

A root object whose virtual address space is empty (i.e. its page table pointer is not present) and which is assigned no system resources (e.g. hardware interrupts etc.), can be destroyed.

Page Table Modification

In states "Setup" and "Teardown", Tau0 may modify

  • the root object’s page tables, and

  • the resources (hardware interrupts etc.) assigned to the root object.

Page tables are created in a top-down manner: first, the top-level page table for the root object is created (its address is stored in the root’s page table pointer), then the page tables pointed to by the root’s top-level page table, and so on.

On the other hand, page tables are destroyed in a bottom-up manner: a page table may only be destroyed if none of its entries are present.

Activation/Deactivation

In state "Locked", the page tables of the root object may be activated or deactivated. If a root’s whole accessible address space has been activated, the root may be activated itself. If the root’s address space is empty, the root can be unlocked.

Activation and deactivation of a root object may involve particular operations depending on the kind of the root. For example, when a device domain is (de)activated, Tau0 modifies the system VT-d tables accordingly.

In dynamic Tau0 mode, when a page is activated, it is unmapped from Tau0’s virtual address space. Thus, we ensure that active pages cannot be modified by Tau0. Moreover, we avoid inconsistencies regarding the caching mode (PAT) of the pages. The page is mapped back to Tau0’s virtual address space as soon as it is deactivated.

As an exception to the above, page tables of memory regions (which conceptually belong to Tau0 and are not accessible by other subjects) are never unmapped from Tau0’s virtual address space. This allows us to perform page translation on a memory region even if it is active.[2]

Root Object Kinds

We describe the different kinds of Tau0 root objects.

Memory Regions

The purpose of a memory region is to collect an arbitrary number of page frames. The virtual address space of a memory region is always of the form {0, …​, n * PAGE_SIZE - 1} for some non-negative integer n. There are commands to append an unused page to a memory region, and to remove (truncate) the uppermost page of a memory region. Note that although the virtual address space of a memory region is a contiguous interval, the mapped physical pages need not be.[3]

Memory regions can be attached to other root objects. When a memory region has been attached to a root R, its page frames may be mapped into the address space of R. It is in this way that we may load programs into memory, create shared memory channels between subjects, and so on. Pages of a memory region may only be mapped if the memory region is active, and the memory region must stay active as long as some of its pages are mapped.

If Tau0 acts merely as a system integrator, the page tables of memory regions are stored in Tau0 private memory, as they will not be required at system runtime.

Subjects

As described above, we may map pages from a memory region to a subject. Moreover, we can assign device resources to a subject (if allowed by policy). Furthermore specific access rights (e.g. access to model-specific-registers) may be granted to subjects by adapting the respective VM management structures accordingly.

Finally execution of subjects may be controlled by adapting one or more scheduling plans.

Kernels

Instances of the Muen kernel (one for each system CPU) are set up similarly to subjects during system integration time. In contrast to subjects, kernels cannot be modified at runtime (as they cannot be deactivated).

VT-d Device Domains

The Intel VT-d system allows to present devices with a virtual address space. For this, a device is assigned to a device domain by setting an entry in a VM table called VT-d context table. This entry contains a page table pointer, as well as an ID for the device domain the device is associated to. Each entry that is associated to the same VT-d device domain shall have the same page table pointer.

Tau0 abstracts from this by viewing the VT-d device domains as root objects. Devices can be added to / removed from a device domain root object in state Setup or Teardown. On activation of a device domain root object, the corresponding entries in the VT-d context tables are written.

Full Typization

As described above, the full typization of a page is a refinement to its basic typization. Currently, only the basic typization "Used" has non-trivial refinements. These are the following. For each full typization, we give a sufficient condition for when it holds.

  • EPT4, EPT3, EPT2, EPT1,

    where EPTi means that the page is an EPT page table of level i (with indices decreasing towards the leaves of the tree).

    Condition: There is a subject with EPT paging such that the page is reachable from the subject in (5 - i) steps.

  • IA32e_PT4, IA32e_PT3, IA32e_PT2, IA32e_PT1,

    as above, but for IA32e page tables.

    Condition: There is a subject with IA32e paging such that the page is reachable from the subject in (5 - i) steps.

  • MPT4, MPT3, MPT2, MPT1,

    as above, but for the page tables of memory regions.

    Condition: There is a memory region with root level k ∈ {1, …​, 5} such that the page is reachable from the memory region in (k - i) steps.

  • VTd_PT4, VTd_PT3, VTd_PT2, VTd_PT1,

    as above, but for the page tables of VT-d device domains.

    Condition: There is a device domain with root level k ∈ {4,5} such that the page is reachable from the domain in (k - i) steps.

  • Tau0_PT4, Tau0_PT3, Tau0_PT2, Tau0_PT1,

    the page tables of Tau0 itself (note that Tau0 is able to access them in dynamic mode only).

    Condition: The page is reachable from Tau0’s PTP in (5 - i) steps.

  • MR_Page,

    the (level-0) page frames of a memory region. Note that these may also be reachable from other root objects if they have been mapped to them.

    Condition: There is a memory region with root level k ∈ {1, …​, 5} from which the page is reachable in k steps.

  • Device_Page,

    a page which is owned by some device.

    Condition: There is a device which owns a memory interval that contains the page.

  • IO_Bitmap_Low, IO_Bitmap_High,

    i.e. the page is one of the two IO bitmaps associated to some subject.

    Condition: There is a subject whose IO_Bitmap_Low or IO_Bitmap_High address is equal to the page.

  • MSR_Bitmap,

    i.e. the page is the MSR bitmap associated to some subject.

    Condition: There is a subject whose MSR_Bitmap address is equal to the page.

  • VTd_Root_Table,

    the page is the VT-d root table (see the section on device domains).

    Condition: The VT-d root table is present and its address is equal to the page.

  • VTd_Context_Table,

    the page is a VT-d context table — i.e. it is referenced in a present entry of the VT-d root table.

    Condition: The VT-d root table is present and it has a present entry that refers to the page.

  • VTd_IR_Table,

    the page is the VT-d interrupt remapping table.

    Condition: The VT-d IR table is present and its address is equal to the page.

Lemmas are provided (currently in Data.Memory.Full_Typization) which can be used to prove that a page is of a certain full typization. Thus, the precondition of a procedure which operates, e.g., only on page tables, can be shown.

System Properties

Properties

  • Tau0 may only read/write accessible pages.

Invariants

We list the following invariants. Some are rather technical, while others are important for Tau0’s security assurances.

  • Whenever a root object M has been assigned to another root object R, M must be of kind memory region and active.

  • No two distinct kernels may be assigned to the same CPU

  • Each device assigned to a subject is active

  • Each device added to a device domain is active, a PCI device, and its VT-d context table has been created.

  • If a page P has been mapped into the virtual address space of a root R, then

    • either P is a memory region page (full typization MR_Page) and R is not a memory region, and the memory region of P has been assigned to R, or

    • P is a device page (full typization Device_Page) and R is a subject (only they may refer device pages), and the device that owns P has been assigned to R.

  • The full typization of every 4K page must be unique. That is, at most one of the conditions listed above is satisfied, and at most by one assignment (i.e. no two distinct subjects may own the same page table etc.)

    Note that this condition precludes "sharing" of page tables and pages between root objects, except for mapped memory regions and device memory.

  • Descendants of active roots/page tables are active.

  • Dynamic mode: Let P be a page with full typization EPTi, PTi, VTd_PTi, MR_Page, or Device_Page. Then P is accessible iff P is not active.

    Since a subject can only be activated if all its pages are active, this ensures that Tau0 cannot modify the memory of running subjects.

Verification

Isabelle Model

Up to now, verification of Tau0 is entirely in SPARK. Properties of page translation, accessibility, and full typization are axiomatized by Ghost procedures.

Later, these axioms will be proven with an Isabelle model of Tau0.

Invariant Preservation

Currently, there remain several procedures whose invariants cannot be proven. This is due to the fact that the invariants depend on abstract state Data.Memory.Memory_State. So whenever there is a write to memory (regardless whether the memory location is relevant to the invariant), the invariant breaks.

This will be addressed in an upcoming release, by providing lemma procedures which specify the memory locations the respective invariants depend on. For example, one of these lemmas might specify that the invariant for VT-d device domains is preserved whenever a memory write does not affect the VT-d root and context tables.

Compilation

Execute make in the working directory of Tau0 to compile the binary and perform proofs.

Using a current Pro version of GNATProve (20.0w, 20190216), 99% of all generated verification conditions are discharged. Results may vary depending on the toolchain. In particular, the version of GNATProve bundled with GNAT Community 2019 proves only 97% of them.

Source Code Structure

  • Tau0

    • Data

      • Roots

        Represent root objects (subjects, device domains, kernel instances, memory regions). Package contains accessor functions.

        • Writers

          This package contains operations which modify root objects. We use a child package to avoid circular package dependencies.

      • Memory

        Represent system memory and its typization

        • VTd_IOMMU

          Represent VTd IOMMU tables

          • Writers

            Operations on VTd IOMMU tables

        • Interrupt_Remapping

          Represent VTd interrupt remapping tables

      • Devices

        Represent devices present in the system

        • Writers

          Operations to add devices, as well as device memory, interrupts, IO ports. The operations may only be called in the command stream’s setup phase.

    • Commands

      Tau0 commands — generally call procedures in package Data. Here, we check the preconditions of the called procedures.

    • Command_Stream

      Provide a stream of commands. In static mode, the commands are extracted from an XML file. In dynamic mode, they will be read from a channel.

    • Command_Processor

      Read from Command_Stream and dispatch to Commands.


1. The reader may observe that if a page is reachable via two distinct page translation paths, then there may arise an inconsistency whether the page is active or not. However, this situation is precluded by the system invariants.
2. For an example why this is necessary, consider the situation when a page of a memory region is to be mapped into another root object. To ensure that the memory region’s page tables are complete, we demand that the memory region is active. But in order to resolve a page from the memory region’s virtual address space into a physical address that can again be mapped into the address space of the other root object, the memory region’s page tables must be accessible.
3. Observe that these two operations ensure that the virtual address space of a memory region is always contiguous as described above.