Admiring the Zircon Part 1: Understanding Minimal Process Creation
Recently Ive been taking a look at Google's new open-source microkernel, Zircon. This is the guy that powers Google's shiny Fuchsia OS which is currently under development.

When speaking about Zircon, we're referring not just to the microkernel binary itself, but also to the user-mode components that make up the base of the OS. It's been difficult for me to ascertain where Zircon ends and Fuchsia begins, but roughly speaking I believe its the kernel and system components of https://github.com/fuchsia-mirror/zircon/.

The current Zircon documentation seems comprehensive enough for anyone to understand how the thing works so blog posts like this one might be somewhat superfluous. However, I found that gaining a full understanding takes some digging through the sources and playing around at runtime. So, mainly as a way of documenting my learnings for myself, I've decided to write stuff down which may or may not be useful for others.

First up, I'm fairly obsessed with Zircon. Having worked with systems-level development in Linux in the past and now with NT somewhat more extensively in my day job from a security angle Zircon gets me excited. Code should always run with only the required amount of privilege and unfortunately its not where we're at in the industry today. Our current OSes were architected years before we really understood the security problem space and bloated up into these hard-to-defend monolithic beasts. I personally believe that microkernels coupled with innovative ISA extensions can go a long way to reducing the problem space and attack surface. Nintendo have something similar going on with their Switch and sure, that was hacked to shreds (for reasons orthogonal to microkernel design), but I'm still a believer in the microkernel model for general purpose computing if done right.

So diving right in.

Zircon is developed in a 'light touch' C++. No crazy SDL everywhere but C++ objects used, imo, neatly and where it makes sense. I believe (from Google's own docs) that it was originally forked from LK which is a microkernel designed for lightweight devices. Not being familiar with LK at all, I'm just going to refer to it all as Zircon, but from what I can tell, the Zircon authors seem to have added object wrappers to some LK primitives and built up from there (I didnt really look into the differences between the two).

In Linux, everything is a file, so operations are usually done through file descriptors. Zircon is more like Windows in that it treats kernel primitives as objects much like NTs kernel does (in fact, some of the nomenclature of kernel primitives somewhat match NTs; processes, jobs, etc.). The kernel exposes handles which are used to manipulate kernel objects. These handles have rights which are verified by the kernel to ensure that callers are sufficiently privileged to act on these objects. As we're speaking a microkernel design here, drivers are user-mode code too. Meaning that no ring0 code other than that of the kernel itself needs to act on the objects within kernel-mode. This solves a major issue in monolithic designs which developers take things like references to internal representations of objects within the kernel; leading to all sorts of instability. Zircon forces use of the API and can manage its own internal object lifetimes as it sees fit. Another huge advantage is the limited set of syscalls that the microkernel needs to expose to user-mode. The syscalls exposed can best be thought of as a set of primitives necessary to enable user-mode to perform the heavy lifting.

My recommendation to all interested: read the official intro docs available here. There will be nothing in this post that you cannot yourself find by sifting through the docs/code.

As an introduction to how things are actually wired up, I thought I'd start by working through an example of how processes are created. Starting at user-mode and working our way up through to the kernel itself.

We'll start with an example program that will spawn a new process and run some boilerplate code. The first thing our app needs to do is request the creation of a process object from the kernel. This is done via the zx_process_create syscall.

The first parameter is a handle to a job object. The new process will be executed as a child of the provided job. A job can be thought of as a collection of N processes that can be managed together. For our example, we'll create the process under the default job for our currently running (parent) process. This can be obtained via zx_job_default() which is a wrapper around __zircon_job_default (which is populated by the libc process start bootstrap code).

The remaining input parameters allow specification of an arbitrary name for the process and what seems to be a placeholder for future options to this call (currently must be set to 0).

A successful call will return a handle to the new process object and a handle to the vmar (virtual memory address region) object describing the root of the process objects address space. The root vmar is simply a description of the available address space (aspace) available to the process. No actual memory has been committed yet (other than storage for the kernel process object itself of course).

kernel perspective
zx_process_create will make its way to the object dispatcher factory for process objects, ProcessDispatcher::Create. This will allocate referenced storage for a new ProcessDispatcher object and add it as a child to the requested job object. An address space (aspace) object for the process object of type VmAspace::TYPE_USER will be created; describing regions that are hard-defined per architecture which form the root vmar. A VmAddressRegionDispatcher object is then created for this vmar and control returns to user-mode with the relevant output parameters, as above.

Next, we'll allocate some memory for the stack and executable code. We do this by using the zx_vmo_create syscall to ask the kernel to create a vmo (virtual memory object) to describe the actual physical memory pages that will hold the executable code and stack space. As usual, we'll get some handles back to allow us to interact with the vmos. The zx_vmo_write syscall is used to write our executable code to one of the vmos. The physical pages represented by the vmos are created lazily; i.e. they dont actually exist until they are populated. A further point to note is that even though we have a root vmar describing the entire aspace of the process and we have two vmos representing regions within the root vmar aspace - one of which being populated with the code that we'll eventually run - neither vmo has actually been mapped into our new process address space. we'll do this now with zx_vmar_map. The stack we map in RW and the code region, WX.

kernel perspective
Virtual memory management is complex and a full review is outside the scope of this post. However, briefly + somewhat inexactly, the following takes place: zx_vmo_create will create a pageable VmObject for a sufficiently-privileged caller and, keeping with the general pattern, a related VmObjectDispatcher for that VmObject. zx_write_vmo, will copy in the relevant data from user-space into kernel-space and then attempt to write it to the relevant VmObject. Ranges are verified and then the target page(s) will be faulted-in and locked while the copy takes place. If a parent to the vmo holds a relevant page, a write-fault will trigger copy-on-write and the page will be cloned. If no page exists, one will be allocated from the free list. When a mapping of the vmo is requested into a user-mode process via zx_vmar_map, the kernel will verify whether the caller has the necessary rights to map the vmo. Page permissions afforded the mapping are a combination of the permissions specified by the vmo and the requested vmar mapping (e.g. one cannot request a ZX_RIGHT_EXECUTE mapping for a vmo that does not hold this right). Requested range (if present) is checked and a VmMapping object created. The mapping itself is triggered by VmMapping::MapRange by committing all requested pages, coalescing the pages to be mapped and then flushing them all at once to the MMU. (What the MMU subsequently does with this coalesced list of addresses is most certainly outside our scope here; in a latter blog post perhaps.)
Instances of executable code run in threads because, well, computer science, so we'll need to employ zx_thread_create to create one of those to host our execution context. Again, a thread is an object, so we get a handle. We can individually name threads as well.

All is in place to start execution. Even though threads are generally started via zx_thread_start, the initial thread in a process must be started with zx_process_start a special wrapper for zx_thread_start which will transfer ownership of a handle from the parent process to the child. We pass zx_process_start the handle to the newly created process object, the handle to the newly created initial thread, the start address for the thread (pointing at the code in our vmo), the aforementioned object handle for transfer (in our case we just create an event object which we don't use) and a suitable initial stack pointer address (architecture/ABI-dependent). The Zircon code-base provides a helper function, compute_initial_stack_pointer, which handily computers this for us for supported architectures (currently x86-64 and arm).

And we're off to the races.

kernel perspective
zx_process_start is responsible for launching the execution of the initial process thread; described by the thread object created with zx_thread_create. It can only be called once per process and has the unique ability to allow a single handle to be fully transferred from the parent to the child process (it is removed from the parent process dispatcher handle list and appended to that of the child). The parent process can therefore no longer act upon that handle after it has been transferred and the child is responsible for its lifetime. Initial thread execution begins by the thread dispatcher for the initial thread object marking its state as State::RUNNING and resuming the thread (it was initially created suspended). This allows the scheduler to schedule the thread for execution.
The code we're running is somewhat useless, it pegs a core by looping infinitely. Important to note though: thread termination must be done either from the thread itself (difficult, because at this stage we have no access to syscalls), from the parent job or due to the parent job having been terminated. we'll call zx_task_kill on the process handle now to terminate the runnable tasks (which in this case includes the child thread) and close all the handles that we've opened over the course of the exercise.

The full test code with error checking and debug spew is available from the links at the bottom. It'll spew something like:

Zircon Process Loader Test ONE - @depletionmode

[+] Created process object (koid: 3018)
[i] Root vmar (koid: 3019) - base: 0x1000000, len: 0x7ffffefff000
[+] Created stack region vmo (koid: 3020) - size: 4096
[+] Created code region vmo (koid: 3021) - size: 4096
[+] Wrote code to code vmo
[+] Mapped stack vmar - address: 0x72df27263000
[+] Mapped code vmar - address: 0x4273d98d0000
[+] Created thread object (koid: 3022)
[+] Started child process
[i] Thread state: 1
[*] Thread in running state - process creation successful!
(Note: A kernel object is identified by its koid - kernel object id.)

But wait! Whats that you say?? No access to syscalls?? Well, that's kind of disappointing! Sure we havent linked in libc or whatever, but cant we just have our boilerplate code call SYSENTER/SYSCALL directly?

Well no; and this brings us to one of the coolest Zircon features yet (and one of my favorite); the vdso (virtual dynamic shared object).

The vdso is a special vmo created by the kernel itself which consists of the necessary code to call syscalls from user-mode. Calling the syscalls directly will not work as the kernel syscall dispatcher will verify that the PC of the caller is situated within the vmar that describes the mapping of this special vmo into the user-mode process. So for user-mode processes to perform syscalls, control must be passed to the vdso code and have that code call the syscalls.

Why have this headache? Why not just let the C library or some other lower-level user-mode systems library perform this work? Well, one great reason is security! For (shell)code to invoke syscalls, it needs to use the vdso code itself. As the kernel is responsible for this vmo (and for the permissions governing how it is mapped), the kernel could choose to map an entirely different vdso into a particular process; thereby restricting the syscalls that that user-mode code is able to call. Think syscall filtering on steroids. It's possible to do things like forbidding drivers from creating processes, or processes from mapping in arbitrary pieces of mmio space, etc. How's that for attack surface reduction?! Oh yeah, and you cant map the vdso vmo multiple times or fool around with the page permissions.

(Note: Zircon support for multiple vdso variants seems to be under active dev at the moment. The current vmo for the default vdso is labeled vdso/full.)

So now that we know that we need to map in the vdso in order to have any hope of calling syscalls; let's look at how that could be done.

First thing's first: finding a handle to the vdso vmo for our child process. There's another wonderful helper function, zx_take_startup_handle, that allows a caller to destructively grab a handle to the vdso vmo (this too is populated in the libc bootstrap code that our parent process executed when it ran - being the normal citizen that it is and not the Frankenstein our child process has turned out to be). For simplicity, we'll use this helper in our code. The vdso format is a subset of ELF so a regular ELF parser can be used to parse out the sections needed to map this guy (using the usual vmar syscalls) into our process. In the Zircon codebase test code (which this blog post is somewhat similar to), they just use elf_load_prepare/elf_load_read_phdrs/elf_load_map_segments to do the mapping. But the format as subset of ELF by design - is simple enough to hack something together ourselves.

we basically need two mappings: one for the first segment (1 page in size) which contains the ELF headers, dyn linking info and constant data and another for the executable text segment. The rodata (first page) must be mapped RO and the text section RX. The kernel will fail any other mapping permissions.

Ok, so we've got the vdso vmo mapped into our child process. If our child process wants to actually call syscalls, it needs to know where the vdso is situated.

In the implementation of the libc loader (which we're not using), a bootstrap message is passed to the child process which includes numerous program arguments - such vdso base address, for example. The bootstrap code (__libc_start_main) running in the context of the newly created process initial thread will parse out the ELF sections from the vdso and set up everything necessary for the child process to immediately start calling syscalls.

we've got none of that here, so we manually provide the example program code with the address of a syscall. If you recall, in our first iteration above, we simply ran code that looped forever - until we externally terminated it with zx_task_kill. Here we'll call zx_thread_exit from within the child thread itself and cause the thread to terminate prior to the infinite loop. we'll pass in the function pointer of zx_task_kill as Arg2 of zx_process_start. This value will be passed by the kernel to our thread code in the rsi register.

Some more spew:

Zircon Process Loader Test TWO - @depletionmode

[+] Created process object (koid: 3016)
[i] Root vmar (koid: 3017) - base: 0x1000000, len: 0x7ffffefff000
[i] Vdso vmo (koid: 1033) - size: 28672
[+] Mapped vdso RO vmar - address: 0x12b86280a000
[+] Mapped vdso RX vmar - address: 0x12b862810000
[+] Created stack region vmo (koid: 3018) - size: 4096
[+] Created code region vmo (koid: 3019) - size: 4096
[i] zx_thread_exit located @ 0x12b862810822
[+] Wrote code to code vmo
[+] Mapped stack vmar - address: 0x1e1a34637000
[+] Mapped code vmar - address: 0x1d886bc4f000
[+] Created thread object (koid: 3020)
[+] Started child process
[i] Thread state: 5
[*] Thread not in running state - process creation & thread exit successful!
So to recap: to create a process capable of calling syscalls, we:
  1. Created a process object
  2. Created virtual memory objects (vmos) for stack, code
  3. Mapped those vmos into the root virtual memory address region (vmar) address space (aspace)
  4. Mapped in the virtual dynamic shared object (vdso) vmo into two vmars
  5. Created a thread object
  6. Started the process
  7. Cleaned up handles
I hope you've enjoyed this little foray into process creation in Zircon. Next up I'll be looking at the driver model and how a simple device driver can be written. Stay tuned for more.