This is very exciting research. There's doesn't seem to be much new information on the page (it's been posted a few times before). In this recent presentation[1] (PDF) MSR's Galen Hunt gives a nice high-level overview of Drawbridge, possible applications and some more details about their current progress.
The Graphene Library OS[2] is a similar implementation for Linux and was released a few months ago. In particular the Graphene Host ABI[3] is adapted mostly from Drawbridge.
The "picoprocess" here seems very similar to Linux's "seccomp" mechanism for restricting the kernel API surface. Current sandboxing mechanisms on Linux (such as Chrome's various sandboxes) use seccomp to restrict almost all syscalls, and their APIs come from IPC to more privileged processes.
Two notable differences:
On the one hand, seccomp provides much more flexibility about the subset of kernel API offered to the process, rather than just saying "here's 45 syscalls".
On the other hand, Drawbridge claims to run unmodified Windows applications; they may have an efficient mechanism for trapping NT "syscalls" and redirecting them to their "user-mode NT kernel" ntoskrnl.dll. However, this might just mean that they run unmodified applications making Win32 library calls, and the libraries have been modified, in which case programs making NT kernel syscalls would not run unmodified.
I'd really like to see a standard mechanism on Linux, similar to the "personality" mechanism, that augments seccomp with an efficient process for defining a new "syscall" layer. That would make sandboxing much simpler and more efficient.
Strategically this could allow Microsoft to drop lots of backwards compatibility cruft in their mainstream host OS and vastly reduce development cost and complexity.
This seems to me to be more along the lines of work into exokernels: most of the "kernel" runs as a library in user mode.[1] The application still has a full range of functionality available. The "45 API calls" is not what the user process can access, but the interface between the, untrusted, user-mode kernel and the, secure, kernel-mode kernel.
> they may have an efficient mechanism for trapping NT "syscalls"
I believe this is not really necessary. The syscall ABI is not stable from Windows version to Windows version - ABI stability is instead provided via the userspace DLLs (kernel32, user32 etc.), which are the official API applications are expected to use.
Some processes might invoke the syscalls directly, but this is a narrow use case (e.g. security software or copy protection wrappers might use syscalls to bypass userspace API hooks).
This seems to map more to Chrome's Native Client/PPAPI than to anything like container virtualization: a reduced set of pretend syscalls that actually go to an interop library that talks to the host OS. It's just missing the "static analysis to ensure it only uses those syscalls" step.
Docker is not anything, Docker is a wrapper for cgroups and namespaces - also utilized by LXC.
Infact, Docker started as a wrapper around LXC. It literally configured and shelled out to lxc-start in order to orchestrate containers.
This however, is much much different to cgroups/namespaces.
What the Drawbridge paper describes is a full user-mode kernel . If you want the analagous implementation on Linux look at User Mode Linux, or the Graphene stuff that has already been linked to.
> "why not make an operating system that has only 45 syscalls?"
Because much of Microsoft's licensing revenue is contingent upon continuing to support the edge cases that are inevitably not part of the set of programs that can be dropped into such a sandbox without problems.
You missed the joke. NT was originally a pretty hardcore microkernel, with even things like graphics drivers being in userspace. However, in the name of performance, more and more stuff was brought into kernel space.
NT was never a true microkernel (much less a hardcore one), and was never intended to be. It was never designed to have any protection between its internal "subsystems" (file system I/O, security, HAL, drivers, etc.) — everything runs in kernel mode, in the same address space, and communicates using direct calls, not IPC.
NT is sometimes mistaken for a microkernel partly because of the graphics driver problem, and partly because it contains a module which Microsoft actually refers to as "the microkernel". This part consists mostly of the scheduler.
What is true about the NT kernel is that it's modular, with strict internal API separation between each subsystem, and that kind of design was (as far as I know) inspired by actual microkernels, as was the idea of hiding the kernel behind "OS personalities" such as Win32.
IRPs are the kernel "IPC" mechanism used by NT... And for whole classes of drivers/subsystems IRPs are the core processing mechanism. The IO manager calls to process IRPs could very well be traditional microkernel message passing API's complete with task switching and message copies.
While NT is not a microkernel, from a kernel API level it appears to basically be one, only missing the fact that everything is not actually isolated.
There's a leak[1] talking about windows 9 possible being partly free, but some features are subscription based. That would give them the incentive to solve the security issues.
It is important to note how this will shake the current state of the application virtualization market.
There is no docker like solution for Windows. All the big players (VMware, Microsoft, Symantec, etc) do tricks to isolate the applications. The tricks are instrumenting API calls and adding filtering drivers. With these solutions only less than 70% can be virtualized and the process can be really difficult.
Hu? Can you elaborate on what you mean by this (app-v, and other windows sandbox/package tools) not being as complete as docker?
Because from what I've seen of the current state of docker, non trivial linux applications seem to have issues in docker as well because they depend on specific things which are not being namespaced well (say /sys manipulations for example, ioctls, or even use filesystem specific APIs).
Docker seems to work well as long as one stays close to web server functionality. (aka LAMP like stacks which tend to only manipulate network sockets and traditional files).
Yes, it is very simple. Docker.io uses LXC where the virtualization layer occurs at the kernel space while applications such as VMware ThinApp and others occur at the user level intercepting Windows APIs which are at a higher level than the kernel. App-V and SWV adds a filtering driver as a way to sandbox registry and filesystem.
One difference in the approach is that, for example, with Docker.io you can have your own isolate network interface while with the current Windows approach this is not possible.
Not quite. Bromiums solutions make use of VT-x/VT-d/EPT. This is pure para-virtualised approach. Atleast from what I can see, it's possible it uses VT-x/VT-d/EPT to implement the process isolation - just somewhat unlikely given how it's presented.
So, this is this a means to bridge the gap between Hyper-V, Hyper-V app streaming and Docker on the windows side? I'm kinda confused what the use case is compared to other existing product offerings.
I suspect that would run into copyright problems as the Win32 replacement is still owned by Microsoft. You could re-make the 40 some odd kernel calls, but the real magic here is the Win32 full replacement with the 800+ calls.
At this point I doubt Wine would be interested, unless somehow Microsoft released most of it as OSS.
I don't think so. The idea is to provide for Windows the same kind of lightweight containerization as docker/LXC provides on linux. So if it were to work the same way, you'd install IIS into your container, and run the container on a host. Having read through the post now, which doesn't have a lot of detail, it seems to me that the picoprocess and library OS concepts are a consequence of not having true kernel-level support for namespaces and something like cgroups. Docker and LXC containers don't host a minimal OS, they share the existing kernel in well-defined ways.
The Graphene Library OS[2] is a similar implementation for Linux and was released a few months ago. In particular the Graphene Host ABI[3] is adapted mostly from Drawbridge.
[1]http://vee2014.cs.technion.ac.il/docs/VEE14-present601.pdf
[2]https://github.com/oscarlab/graphene
[3]https://github.com/oscarlab/graphene/wiki/Graphene-Host-ABI