SeL4 is old news - not a criticism, but has anyone added another formally proven layer or component? (Edit: I mean new components beyond the microkernel, not improvements to the microkernel.)
Also, I suspect some people - maybe some on HN :) - get emotional overload when they see the word 'proof' and their intellectual functions stop. It's not a panacea for the infinite problem of secure IT; it isn't a way to create or discover a perfect and flawless diamond of software. IIUC it means it's proven to meet specific requirements in specific conditions, and those requirements and conditions can be quite narrow; and it says nothing about other functions and conditions that are out of spec. Is that roughly correct?
What does it mean in practical terms? What does a security professional see when they see 'formally proven software'?
What are the specs that SeL4 meet (no, I haven't looked at the OP in a long time)? Isn't that the essential information here?
Being "formally proven" to be free of various flaws did not make seL4 immune to memory corruption flaws. Despite the formal proof, a memory corruption flaw was found a few years ago. The PRs for the commits fixing it and amending seL4's proof are public:
Interestingly, the PR fixing "register clobbering" in memory was not labelled bug, so it is not listed when you filter by "bug".
I used to think seL4 was immune to these issues given its proof allegedly saying it is, but upon seeing this, I have come to think that the proof is not as comprehensive as the wider community has been lead to believe. That said, seL4 is still a very impressive piece of software.
Finally, to answer your question. The specification that seL4 meets is published on github:
I do not see how that is a kernel memory corruption. I do not understand the code in context, but it appears to just be a case where the kernel incorrectly clobbered a userspace register which could result in incorrect userspace code execution. A error in that class would not inherently violate the system safety properties of the kernel itself.
If so, that would only fall under a functional specification problem in the userspace API making it hard to use the userspace API in the intended/desired way. That would make it hard to use the kernel and verify your system using the kernel, but it would not constitute a kernel memory corruption which is what safety specifications and proofs of the kernel itself would be concerned with.
I said that a memory corruption flaw was found in seL4. This is a fact. If you read the patch, you will see that a message that was just created had been being overwritten (which is explained in a code comment). This occurs inside the seL4 kernel code, and the proof previously permitted this behavior (according to one of the developers’ comments on the PR). The proof had to be amended to disallow this (which is the second PR). You should see this in the links.
You can not just assert it is a memory corruption without explaining how it is a memory corruption.
It appears to be a case where a wrong, but legal output is being stored. That does not, inherently, constitute a memory corruption.
For example, suppose I have a sorting function. I pass it a vector of integers by reference and I intend it modify it into a sorted vector of integers. Instead, that function zeros every element of my vector. That is a sorted vector, so it technically meets the specification. That function “corrupted” the contents of my vector. That is not a “memory corruption” bug. That is just a plain bug where I output the wrong thing into memory.
A memory corruption bug needs to violate the explicit or implicit invariants of the memory being “corrupted”. Out-of-bounds write, use-after-free, bypassing class interface, unguarded shared parallel access, etc. Writing incoherent, but legal values through provided pointers to uniquely owned memory is wrong, but not “memory corruption”.
Furthermore, this is not a definition game where we are disagreeing on the definition of “memory corruption”. They do not claim blanket safety from “memory corruption” which may be a class up to interpretation. They claim safety from [1]: buffer overflows, null pointer dereferences, pointer type confusion, memory leaks, and other specifically named and defined forms of memory corruption errors. Please identify which specific enumerated memory safety property was violated.
Again, I am not asserting it is not a memory corruption bug of some class. But it is very unclear from the patch and comments that the stated bug was actually a memory corruption without significantly more contextual detail. I literally do kernel development professionally and even I can not tell from the patch out-of-context; there is no way almost anybody else would be able to tell without more detail.
Most who read that would expect the bug fixed in the linked GitHub PR to be impossible. The loophole was that the specification had a flaw, which was only mentioned as a vague hint about “unexpected features”.
You keep dancing around the question of it being a memory corruption despite confidently asserting it as a clear, indisputable fact. Please support your claim by identifying the specific memory safety property that was violated and how it was violated.
I provided citations. Use them and do your own research. I think you will find the answers in what the seL4 developers published, but I have no interest in spending days or even weeks pointing to things until you are satisfied.
You made a strong claim which you have failed to support adequately and yet have continued to double down on.
I did my research and presented my own interpretation which fundamentally disagrees with your interpretation. You have refused to address it, instead continuing to assert your interpretation without further detail. That does not prove my interpretation is correct, but it does indicate that the citations and evidence you have provided do not clearly lead to your conclusion and you neither identify the errors of my interpretation or directly demonstrate the correctness of your interpretation which could be done by demonstrating a simple counter-example demonstrating what memory is corrupted.
You claim it is indisputable fact, yet you claim it would take "days or even weeks" to present a clear example of the corruption. I do not see how you can claim the bug is clearly and indisputably a memory corruption bug that anybody can see from the citations you provided when it would take even you so long to provide a clear example.
I will say this. The seL4 developers made the strong claim. Evidence that it is not as strong as people who heard about it thought is not a strong claim. It is just being realistic. The seL4 developers themselves say that the properties they list are things that come out of the proof and your approach of looking for properties that they promise is a backward way of going about things, since they did not claim to exhaustively list all properties.
If you want to understand things to your satisfaction, read the proof and figure it out yourself. I have already understood things to my satisfaction and the double digit number of upvotes that I have received suggests to me that my explanation was to others’ satisfactions as well. You likely have many days or weeks or work ahead of you to answer a passing curiosity of yours whose precise answer is not going to change anything for anyone. Have fun.
writing into memory that doesn't belong to you is memory corruption.
in this case: kernel (induced) memory corruption.
and it's really a bad thing:
functional safety technically boils down to the job of the kernel to provide "spatial and temporal freedom from interference", meaning all subsystems are isolated from each other. no "process" can write to another process, no process can starve another process. in this perspective, the kernel is another (special and privileged) process on the system.
a kernel that randomly writes to user space processes is inherently unsuitable for functional safety.
it's a huge no no. a failure in the goals and purpose of the kernel.
and finding this here specifically was heavily shaking up confidence in formal specifications and formal proofs of compliance to these specs.
Again, the nature of the bug has not been appropriately established. It appears to me to be a bug where a output register that the kernel is expected to write is written with a incorrect, but legal value in certain cases. If my interpretation is correct, which it might not be, that is just a wrong output not a "memory corruption". The claim of "memory corruption" needs a clear detailed explanation of what memory safety property is violated and how it violates that.
Second, no. The spatial separation guarantees you want from a kernel are that a domain can not induce spatial interference in a different domain that is isolated from the former. Internal memory corruption of a user domain, even uncontrolled asynchronous corruption by the kernel, makes it not very useful, but does not violate the protection property. This is a case where it is important to distinguish between "safety" and "usability". It is "safe", but not very "usable".
If your memory corruption only occurs during specified system calls, then that problem is further reduced making it easier to "use" the system. If the memory corruption is highly constrained, say it only corrupts in-bounds of passed in pointers, then it is easier to "use" once again. If there is no memory corruption, but it instead sometimes outputs invalid garbage, that is easier to "use" once again. If it outputs legal, but incorrect values, that is easier to "use" once again. None of these affect "system safety", but all are important bugs to fix to enable to creation of functional, usable systems.
where I come from, the kernel must not, ever, write to unwanted locations, even if in theory it could.
wrong data as you point out is a completely different point of discussion, that could be a bug in the function which may or may not be safety critical , but hey, depending on the function you call, a wrong value could also come from some other process and the kernel is just the messenger of false data, not it's origin. no, "wrong value" that's not what I mean.
and a bug in the user process, writing data to the wrong place, also is not what I'm talking about.
what I consider violation of the functional safety requirement of spatial freedom form interference on the kernel is and unexpected "wrong location": the kernel "surprise surprise" writing to some address or some register (!) the kernel was not supposed to modify --- these _both_ are a violation of spatial freedom from interference, in the functional safety sense, to my limited understanding. "your dance area, my dance area" type of violations.
this reading comes from discussions about how if at all it might be possible to achieve this with a monolithic FOSS Linux kernel. There the page table can be set up to give quite some protections, even protecting user space from the kernel, but then who ensures the page tables aren't corrupted "somehow" by some rando kernel module? And similar discussions exist to reliably save and restore user space registers on context switches to the kernel...
but that's not what I wanted to get into, my point rather is: user space must be safe not only from other processes but also from the kernel. and that's something a microkernel can prove much easier than a monolith.
It is a nuanced point. Certainly, you want to have a API that conforms to specifications and does not ruin your desired program invariants. But what if it does? What if you make a specification error? How do you modularize your specification so errors in one part of the specification do not invalidate the rest?
For instance, suppose I have a kernel with spatial safety. I then add the system call, “Corrupt random memory in my own process at a random time”. Is that kernel still safe? All that function can do is destroy the invariants of the calling program. It can not turn a innocent isolated program malicious no matter how malicious another process is or became. What if nobody calls that function?
It would certainly be hard to use such a function for any purpose. It may allow a malicious input crafted by a communicating malicious process to cause a problem in a innocent, but incorrect program. But a correct, isolated program, using APIs correctly would be unaffected no matter what any program does even with access to that new “memory corruption on demand” API.
So, in this case, it would be fair and reasonable to classify the kernel as safe despite having a absurd system call in its specification. And, as basically any other API with a memory corruption is going to be less absurd, it only makes sense to classify those as “safe”, though very likely “unusable” or “hard to use”.
The distinction then becomes being able to cause memory corruption in isolated, correct programs. In that case, a program is no longer safe even if it proves itself correct. You must prove the entire system is correct and all other programs correct to establish any guarantees about any individual program. You can no longer “isolate” or “separate” your analysis which makes the problem even more intractable. That is the first problem you want to solve which is what this definition is intended to achieve.
That resulted in a memory corruption flaw in the C code. If you build a bridge to a specification and the bridge collapses, does saying that the specification was wrong in any way lessen the reality that the bridge failed?
What's odd about this argument is that such bugs would be at risk of reintroduction in any other piece of software. Indeed, each release is expected to introduce new bugs and vendors typically sell service contracts based on the perpetual brokenness of existing software engineering processes. Whereas formally verified software typically has single digit numbers of bugs ... ever.
Sure, you get to score rhetorical points about the proof being incorrect once. But this bridge is now fixed and won't fail that way again in the same way.
Cartels and governments regularly hack phones and spy on and even kill activists. The current processes you are defending get people killed and are undermining civil society. SeL4 is a big deal and definitely worth the investments needed to make such reliability the norm instead of accepting the continual bridge collapses that occur in Linux land.
Tell me how process isolation prevents SQL injection. I could ask about any of a great variety of security issues, but I think SQL injection is a great choice for debunking “process isolation is the bedrock of all other security guarantees”. We do not have to worry because we have process isolation is basically this:
I suspect you did not mean to say that, but your comments can be interpreted to mean that, especially when you suggest people will die if seL4 is not used as if nothing else matters, and inject that into a discussion of whether seL4’s is free of flaws (it is not).
Using seL4 is like flattening the ground on which a building foundation will sit. Just because the ground was flattened to within microns of flatness does not mean there is nothing else wrong with the foundation and building that are built on it.
Perhaps I am being pedantic, but you deserve this after suggesting the mere act of giving an opinion on seL4 made me an accessory to murder.
"Tell me how process isolation prevents SQL injection."
You need process isolation to make the system secure enough to even require SQL injection. Otherwise, they can send packets to attack anything else on the system, get full privileges, and swap your SQL-enabled application with their own. While taking your database.
Further, techniques to block SQL injection work even better when they're in an isolated partition the attacker can't escape from. Even better if data can only flow in a specific way. Even better if it can be reliably cut off and recovered upon detection of a failure. That's why high-assurance guards used to sit between sensitive systems and untrusted networks. They did all of these things.
Now, for application-level issues. The Separation Kernel Protection Profile, which inspired designs like INTEGRITY-178B and seL4, called for secure middlewhere to protect communications between partitions or systems. That included message routers that enforced the security policy within the system. That would include protocol-level defenses, like for SQL requests, to protect communications outside the box.
seL4's isolation was, by itself, never intended to stop unrelated attacks. The vision, described in NICTA's Trustworthy Embedded Systems program, was to combine many components to do that. I'll add that seL4 team did a great job documenting the many points they may fail in their Assumptions page. They were much more open about it than most competing projects.
The systems we're talking about don't use SQL to communicate. They use data structures in a wire format of some sort because they are embedded systems. Don't pretend like we're talking about web development here, because we're not. SeL4 is designed for hard real time embedded systems and uses formal methods, which have and continue to be a mainstay approach in places which value security and reliability. And they really mean _reliability_ as in reliably wrong or reliably right.
So being able to adjust the spec and fix the system is actually a feature because it means your prover acts as a high-quality test.
As always, with systems like this if you use a formally-proven operating system kernel you will probably assess which other code paths in your application code are on the critical path and use formal methods on those components. And you will also be doing other less formal work like FMEA to complement. So you're not reliant on any one single method of assessing whether the system works, whether it meets the specification, and whether it can be relied upon to do the right thing in a bad situation.
The systems we're talking about don't use SQL to communicate. They use data structures in a wire format of some sort because they are embedded systems. Don't pretend like we're talking about web development here, because we're not.
The other guy mentioned phones, which run UNIX clones. It is fairly obvious that we were talking about computing in general, which includes all of the things you claim to be out of scope.
SeL4 is designed for hard real time embedded systems and uses formal methods, which have and continue to be a mainstay approach in places which value security and reliability.
Most embedded systems (e.g. the RP2040, the MSP430, etcetera) do not even have virtual memory protection, which makes seL4 a non option for them. When seL4 is an option, it is a tiny fraction of the code needed for the machine. All of that other code matters, especially since it is where the attack surface area is. seL4 on such machines is like the admin account in this picture; compromising what actually matters does not require compromising seL4:
While seL4 is good at what it does, what seL4 does is extremely minimal and it is increasingly obvious that a vocal minority overrates it. Using seL4 is not going to stop vulnerabilities in userspace or even in drivers that were previously in kernel space, or in VMs running on top of it, but people who claim systems are insecure without seL4 insinuate that it will. They are like the Apple Store employee who once suggested to me that if I run Windows on Apple hardware, it will have none of the problems Windows normally has.
"process isolation is the bedrock of all other security guarantees" means process isolation is a necessary (in practice) condition, not a sufficient condition.
"Tell me how process isolation prevents SQL injection." is a non-sequitur because that is only coherent if they claimed process isolation is a sufficient condition.
To disagree with their position, you need to demonstrate a system where there is no process isolation and where maliciously crafted inputs intended to cause SQL injection can not result in undesired database access. So, one process gets to arbitrarily modify the contents of the SQL server process including changing the code and you need to prevent it from wrecking your database.
"process isolation is the bedrock of all other security guarantees" means process isolation is a necessary (in practice) condition, not a sufficient condition
The attack surface of networked machines is unaffected by process isolation. It is only when you are already inside does process isolation have any use. Thus the idea that it is the bedrock is wrong.
"Tell me how process isolation prevents SQL injection." is a non-sequitur because that is only coherent if they claimed process isolation is a sufficient condition.
It is a counter example that shows that process isolation is not the bedrock that it was claimed to be. Id sequitur.
To disagree with their position, you need to demonstrate a system where there is no process isolation and where maliciously crafted inputs intended to cause SQL injection can not result in undesired database access. So, one process gets to arbitrarily modify the contents of the SQL server process including changing the code and you need to prevent it from wrecking your database.
Non sequitur applies here. The protection against SQL injection and process isolation have no link whatsoever. The fact that there is no link was my entire point. The example did not need to be SQL injection either. Telling me that I need to prove a link that is the opposite of what I was saying does not follow at all.
As I previously said, I could have cited just about any common vulnerability and process isolation would have done nothing about it. For example:
I only cited SQL injection because I could not think of any claim more absurd than the claim that process isolation is bedrock of guarantees against SQL injection, but if process isolation were the bedrock of all other security guarantees, then it would some relationship to protection against sql injection.
You do not appear to understand the difference between "necessary" and "sufficient".
Process isolation is not sufficient (i.e. adequate on its own) to make a SQL server secure against maliciously crafted inputs intended to cause SQL injection for the purpose of undesired database access because it does not sprinkle magical security pixie dust on it.
Process isolation is necessary (in practice) to make a SQL server secure against maliciously crafted inputs intended to cause SQL injection for the purpose of undesired database access because otherwise a attacker can just modify the code of the SQL server to directly do the undesired database accesses. Without process isolation your SQL server security/design can be bypassed.
The amazing thick walls of your castle do not matter if you built it on sand. The amazing solid bedrock is not sufficient to claim your castle is defensible if your castle is a wooden shack, but it is necessary for building a defensible castle.
A lack of process isolation does not allow your SQL server security to be bypassed when the adversary is not on the machine in the first place. The adversary would usually fill out a field on a form, which is where SQL injection attacks usually occur:
What? You'd put more faith in a kernel that has no formal verification of its conformance to a secure spec and has orders of magnitude more code in it that can go wrong? What is the basis for this confidence?
Does it even matter when the attack surface is in userspace? Formally verifying seL4 against a potentially buggy specification does not provide any assurances about the user space code that must run on top to produce a working system.
If you subscribe to the idea that the mere use of seL4 makes things more secure, you might use seL4 as a hypervisor beneath Linux on machines that would previously have run Linux on bare metal and then you would be using seL4, but you would not have made anything more secure. The increase in the TCB would suggest such a setup is less secure.
The security of a machine using seL4 is dependent on what runs on top of it because seL4 is not the boundary between attackers and their targets. The only exception would be if you are a virtual machine provider and then seL4 would only be part of the boundary, rather than the whole boundary.
> but you would not have made anything more secure.
This is does not follow. Why do you think a hypervisor with no verification to a spec is necessarily more secure than a hypervisor with a verification to a spec? Can you somehow quantify the latent bugs in the non-verified code?
The “that would previously have run Linux on bare metal” part of that sentence was important. Try reading it again. It never compared seL4 as a hypervisor to another hypervisor. It compared using seL4 as a hypervisor to not using a hypervisor at all on a machine that does not need a hypervisor. The addition of seL4 there is not only pointless, but possibly harmful if you consider the increased TCB size.
By the way, seL4 as a hypervisor uses components that are not verified to a specification, so saying that it is verified to a specification is not quite correct. They also seem to be proof of concept examples rather than a production solution when I look at their documentation:
Why is it that everyone on the “seL4 makes everything secure” bandwagon seems to have a tunnel vision that ignores that seL4 is useless without a large amount of unverified software that runs on top and that security will typically depend on that software, rather than seL4?
What's odd about this argument is that such bugs would be at risk of reintroduction in any other piece of software.
They can be reintroduced into seL4 too. Just have an amendment to the specification adding a new feature say the bug is okay without anyone realizing it does.
Sure, you get to score rhetorical points about the proof being incorrect once. But this bridge is now fixed and won't fail that way again in the same way.
The proof itself proved something other than what was advertised, which allowed for this issue to exist. If you look at the bug tracker, it appears that this has happened more than one time.
While seL4 is an amazing achievement, it is clear that it is not the bulletproof software that people thought it was. There is no proof that the specification is problem free after all.
Cartels and governments regularly hack phones and spy on and even kill activists. The current processes you are defending get people killed and are undermining civil society.
This is a gross misrepresentation of what I said. Having an objective evidence based opinion that seL4 is not as good as advertised has nothing to do with people dying.
SeL4 is a big deal and definitely worth the investments needed to make such reliability the norm instead of accepting the continual bridge collapses that occur in Linux land.
seL4 does not achieve this. It is the equivalent of less than 0.001% of all of the code run on machines. It makes no guarantees that the >99.999% that people use is okay.
> If you build a bridge to a specification and the bridge collapses, does saying that the specification was wrong in any way lessen the reality that the bridge failed?
You framed this as a loss of confidence in the techniques that prove the absence of various flaws. This bug should not undermine such confidence as the bug was always in the place (the spec) that would allow such bugs.
When seL4’s proof was published, people regarded seL4 to be flawless software and the fact that a problem in the specification would lead to issues was largely disregarded.
Now that it is clear that the specification has had
problems, seL4 should not be considered issue free, since there is no proof that its specification is issue free and there is proof that its authors made a nonzero number of mistakes.
Anyone minimizing the impact of this flaw by saying “the specification was at fault here” completely misunderstands why the rest of the industry thought highly about these techniques in the first place. They had promised to produce flaw free software. Now that it is clear that is not true, the level of confidence the industry places in these techniques naturally needs to be adjusted. We cannot ignore the massive caveat of the specification needing to be flaw free anymore.
> When seL4’s proof was published, people regarded seL4 to be flawless software
No, that's not true. The framing was always that it surpassed an incredibly high safety bar that no other kernel had surmounted, far beyond even security certifications needed for protecting national secrets. This is still true.
> the level of confidence the industry places in these techniques naturally needs to be adjusted
No it doesn't, people using formal methods understand perfectly well what confidence verification should bring and the caveats that are necessary. If by "industry", you mean Joe-know-nothing-about-verification-shmoe, then I'm not sure why his opinion was particularly important to begin with.
No, that's not true. The framing was always that it surpassed an incredibly high safety bar that no other kernel had surmounted, far beyond even security certifications needed for protecting national secrets. This is still true.
A number of people who read that interpreted it to mean seL4 is flawless. Now that flaws have been found, those people have had to re-evaluate their views. Those who understood seL4’s guarantees correctly are an extreme minority.
No it doesn't, people using formal methods understand perfectly well what confidence verification should bring and the caveats that are necessary. If by "industry", you mean Joe-know-nothing-about-verification-shmoe, then I'm not sure why his opinion was particularly important to begin with.
>99.99% of the industry does not use formal methods, so those that understood the guarantees were weak enough to allow “impossible bugs” in the real world were an extreme minority. The rest of us had the view that I described.
You can deride people who do not do formal verification, but I guarantee everything you use has unverified code written by people who do not do formal verification. If we restricted computers to formally verified code, we would not have working computers, embedded or otherwise.
the point are people not using formal methods yet. automotive for example continues amazing aspire iso 26262 rain dances to tell themselves systems were safe. and does not allow to replace that with formal specs and proofs of compliance. that's just complementary the "real safety work" has to be done with paperwork v model tracing.
Many bridges had to collapse until we learned how to build them.
As for seL4, the proofs (which are actually the state of the art in that field, yet definitely can be further perfected) are just one of the pillars supporting it e.g. the size of the TCB is another.
We do not give up on the goal because there was a mistake once. Nor is the goal not worthy; this is the absolute best the world has atm.
I believe only certain versions and on certain architectures is seL4 verified for. There are no bugs found at the C source code level for these builds of seL4.
This issue appears to have affected all architectures. This issue was present in the specification against which seL4 was verified. You can say that there are no bugs by virtue of it following the specification, but if the specification was wrong and in this case it was, then were there really no bugs?
Even if it's not exactly seL4, there's good value in taking inspiration for design elements. It would still be a lot more robust than commodity operating systems.
We should look at formal verification like everything else: in terms of statistical effectiveness. It's not really important whether or not there are _no_ bugs. What's important is how many bugs there are for each unit of "effort."
For example, if this were hypothetically the only bug that was ever found, then that would be a pretty damn good argument that formal verification is effective. Because the bug rate of other non-verified operating systems is much higher.
Has anyone added another formally proven layer or component? Yes, they're being added all the time. Recently added features:
- Support for a bunch of new architectures, including RISC-V.
- Mixed criticality scheduling, which provides capability-based access to CPU time, mechanisms to limit the upper bound of execution of a thread, ensuring that high-criticality tasks have priority and access to necessary resources, and allowing "passive servers" which run on scheduling time donated by the caller.
- Microkit, a new verified abstraction layer which made it much much easier to build actual systems using seL4.
- Very recently, the Device Driver Framework, a device-driver template, control and data plane implementation, and tools for writing device drivers and providing device virtualisation for high-performance I/O on seL4.
Formal verification can guarantee that specific requirements hold in specific conditions. It's true that in general such requirements and conditions can be quite narrow. But seL4 itself has a whole lot of proofs about it, covering a wide range of properties that one would want in a kernel, and these guarantees hold under very weak assumptions. Even the correctness of the C compiler is not assumed, they have a separate tool which looks at the compiler output and proves that the compiled binary behaves correctly with respect to the required C semantics.
The requirements that seL4 meets include the following: the binary code of the seL4 kernel correctly implements the behaviour described in the abstract specification and nothing more. There are no buffer overflows, no memory leaks, no pointer errors or null pointer dereferences, no undefined behavior in the C code, no termination of the kernel apart from the explicit ways enumerated in the spec, etc. The specification and therefore the seL4 binary satisfy the security properties of integrity and confidentiality: the former means that there is absolutely no way for a process to change data that the process has no explicit permission to change, and the latter means that a process cannot, in any way, read data it has no explicit permission to read. It even shows that a process without permission cannot indirectly infer the data through certain side channels. Beyond security, the expected worst-case execution-time guarantees and scheduling properties are also met.
The developers of seL4 have been in funding hell for years. Most of their work was darpa research for remotely controlled drones. The US Military would very much like drones that can't be hacked.
Their current work is on LionsOS which is more towards greater adoptions: https://lionsos.org/
For example: No buffer overflows, null pointer exceptions, use-after-free, etc. On ARM and RISCV64 not even the C compiler has to be trusted, because functional correctness has been proven for the binary. And there are more proofs besides functional correctness.
https://docs.sel4.systems/projects/sel4/frequently-asked-que...
Lots of type-level programming for tracking resources, hardware access, and capabilities at compile-time and trying to bring some of the underlying kernel guarantees up to the compiler because finding out about and debugging issues at runtime was just the absolute worst.
In high-assurance, formal methods were one of many used for the system. It was reduced complexity, precisely-specified behavior, likewise for security policy, proofs the design/code fulfilled it, exhaustive testing, leak analysis, pestering, and ability to recheck all this.
That's so hard it was mostly done for only kernels commercially. From there, your real-world use will involve user-level components, protocols with interactions, and so on. Add one thing to the proven TCB and maybe your proof disappears. So, every trusted component has to be rigorously designed. Then, all the interactions for a holistic argument.
Most don't do that. For example, seL4 hosting a web browser won't help you if it's a web application. IBOS and a WAF on seL4, all proven, might cover that. Harder argument. Most of these projects never go beyond the kernel or TLS stack or whatever.
You can do stuff with that. Look up Nizza Secure Systems Architecture for a paper with many examples. Yet, they usually don't cut it. They're not compatible with popular apps either. So, most people just harden Linux or BSD or Windows.
If you mean web application firewall, that seems too complex to be proven? With infinite possible inputs, inspecting it would seem to require too much code to prove - unless it merely verifies traffic matches a short whitelist.
Why talk about high assurance talked in the past tense? Is the field moribund? (I don't mean to point out a typo, if that's what it is.)
It could potentially be proven. We think it's unlikely due to cost or complexity. So, the fields default position was that most things would be low to medium assurance. Only select components would be high assurance. I hope that answers your broader question about adoption of seL4-like techniques for larger systems.
Far as past tense, I left high-assurance security when I found out Jesus died for my sins and rose again. Had to learn an entirely new life with a focus on human beings instead of tech. I'm recently considering going back into security since I spend up to 12 hours a day at work anyway (two jobs).
I should also note another reason I say past tense. Many people I learned from have died, retired, or switched focuses. The reason for the later is that there's no real market for high-assurance security. Most won't spend the money or make the tradeoffs. So, security-enhancing popular platforms while assuming they'll get hacked (and responding correctly) seemed like a better use of time. I mean, I could probably could all the high-security vendors on my hands...
Thank you! And, yes, security can help people when your job allows you to do that. I was talking about high-assurance security specifically. That's products built with techniques like B3/A1 or EAL6/EAL7 required. Examples:
The market for this is tiny. NSA once estimated the labor force for developing them was under 200 people worldwide. I know maybe a dozen or so companies even selling such things. Given tough tradeoffs (eg less features), the buyer market is tiny even though development cost is higher.
Medium assurance is the best route commercially. That really just means a solid, development process with selective use of assurance-improving techniques. People writing key parts of their code in SPARK Ada or Rust with careful, design reviews is an example. Certain products, like HAProxy or OpenBSD, are coded so well they rarely have problems. Some web frameworks have security baked into them. Using such approaches is the most likely to succeed commercially. Others use architecture or hardware to great advantage, like Hydra Firewall or Secure64's SourceT.
The fields last idea for commercial high-assurance was "selfless acts of security" per Bell. He thought, one component or system type at a time, companies with the money should sponsor experts to build exemplars. They get maintained over time at hopefully lower cost. They're designed to be integrated into low-assurance systems to boost them. While using hybrid funding, CompCert, miTLS, and Muen SK are examples of this.
I just rarely see this happen. Both companies and FOSS favor ease of development or velocity over assurance of artifacts. The market usually requires integration with insecure systems whose integration itself might be impossible to secure. So, I have no optimism about high assurance but selective improvements are doable.
A lot has happened in the last few years, such as the founding of seL4 foundation, the maturation of mixed criticality scheduling, the celebration of multiple seL4 summits, and the deployment in practical use by multiple companies and countries, in several scenarios.
Notably, Apple joined the seL4 foundation, as they use it in several of their products: https://sel4.systems/Foundation/Membership/ (Not sure if they've stated publicly which, but it's been pretty well known for a while now).
I’m a fan of microkernel hosts for guest monoliths, thus our servers are running seL4 as safety layer and backups for FreeBSD VMs with jails for renderfarm, BEAM clusters and Jenkins.
All I’m missing is an ARM port of DragonflyBSD for its threading and in-process kernels (hybrid kernel design). My dream is to run it on 128 cores Ampere Altra to run OpenMoonRay more effectively.
Yes, commercial production servers. Similar to some other systems and certain consoles, microkernel is used as hypervisor that protect bare metal hardware from potential hazards when guest monoliths like mentioned FreeBSD becomes compromised. Guest components are running our software and services inside jails on two 128 core, two 64 core and one 32 core Ampere processors where first two machines run heavy workloads like offline rendering and Erlang VM clusters, while others are used to provide a few VPS servers, and internal services.
I think at this point the argument for or against microkernels is becoming moot. The only way to provide fast, efficient and safe access to privileged services is via hardware mitigations. Software can only do so much.
It's like the difference between a 80286 and a 80386: The latter added the hardware support for true multitasking that the former lacked. Since then there's been an ever increasing number of hardware-level protection mechanisms added, like those which enabled hypervisors.
Apple in particular has been adding a bunch of stuff to their SOCs to protect the kernel, drivers and components at the chip level, as well as enforce privileges on running threads and when using pointers. [1] This doesn't mean the OS is impenetrable, but it's a lot more effective than a software only strategy of managing privileges.
Seems to me that utilizing this stuff (or similar), the architecture of the kernel really isn't that important any more.
Yes. There is still plenty of work to be done in the OS research space. There has to be software interfaces and API's to all this new hardware.
I also think there is a lot to be learned from using micro/hybrid systems which are much more composable. e.g. Plan 9 is a great example of a hybrid system which makes use of a singular protocol, 9P, to serve all objects in the system to userspace. It is hybrid because some parts are in-kernel to avoid the overhead of system calls such as IP, and TLS. Another interesting design aspect is how the in-kernel drivers are minimal mostly acting as a 9P interface to the hardware logic. This way you A. turn a machine object like a pointer or record into a seekable file and B. secured that file using standard unix permissions and C. can easily distribute the components across machines via a network. Now you can securely push the driver logic into a userspace program. And 9p is network and architecture transparent meaning you can work across multiple machines running on Arm, x86, mips, etc. All this come out of the box.
When I go back to Linux/Unix or Windows from Plan 9 its a sad and frustrating time. They're about as flexible as igneous rock. All the features are ad-hoc bolted on in mutually incompatible ways using a myriad of protocols that all do the same thing - serve files/objects. Ugh.
The utility of microkernels is orthogonal to hardware/software co-design.
From a practical engineering perspective, monolithic kernels were faster/easier/had more resources behind them and security is as robust as can be had with C (i.e. best effort and full of bugs). Lots of hardware has been introduced to try and mitigate that mess. But you theoretically wouldn't need a security co-processor with SeL4 because you have a very high degree of confidence that processes are isolated from each other and there are no root level exploits. So yes, hardware/software codesign is important!
That being said, the team behind SeL4 has had to spend a lot of engineering resources on eliminating side channels from the hardware. Hardware is also faulty as the real world doesn't care about your physics simulation.
The benefits of a microkernel here is that it is small enough for formal verification to be tractable. The proof itself is 10x the size of the kernel. Context switching on SeL4 is an order of magnitude faster than Linux, so the performance impact should be negligible. But if we magically could verify the millions of lines of a monolithic kernel, then it would still be faster to not do context switching. Indeed, the SeL4 team tried to move the scheduler to userspace but it cost too much in terms of performance. So it was kept in and added to the proof burden.
> It's like the difference between a 80286 and a 80386: The latter added the hardware support for true multitasking that the former lacked.
I'm not sure this is a good comparison. The 286 did support true multitasking in protected mode, and it was used in many non-DOS operating systems. What the 386 added (among other things) is the virtual 8086 mode which allowed it to multitask existing real mode DOS applications that accessed hardware directly.
That doesn't seem right. Even with strong hardware protections, how is Linux's TCB comparable to a microkernel? Unless it just recreates the exact same protection domains, there will be more vulnerability in Linux. Rather, the thing about hardware is primarily that it makes things more efficient. For example, microkernels nowadays are already quite robust because they make good use of certain hardware: the MMU. Then the small TCB of a microkernel gives the kernel credibiity, so the kernel and hardware together make for a solid infrastructure. So really it's a matter of how much "cheating" we're allowed to do with hardware, but microkernels overall utilize the protection better. Or see exokernels.
I gave a presentation on SeL4 for my local OWASP chapter. I'll have to see if I can dig it up.
The project truly is a nice piece of kit, but I would hesitate to consider it as a replacement for Linux, especially for general purpose computing. Though that isn't to say microkernels are terrible for it in general. RedoxOS seems to be making some inroads lately and it uses a microkernel written in Rust.
The question is always "how big of a replacement are we talking?" Redox seems to be committed to interoperate well with POSIX, which naturally informs its design decisions. And there's a huge difference between having technical capabilities and succeeding. Although if Redox does succeed, it would already be a good step. seL4 is even more extreme in these qualities: its technical advantages are superb, but so far (and I think this will continue) it does not have what it takes, whatever that encompasses, to be the Next Big Thing. Ignoring political considerations, I think microkernels will be successful, and rightfully so.
>I would hesitate to consider it as a replacement for Linux
It would depend on the scenario. Of course, Linux is easier to work with but OTOH there are requirements which only seL4 can satisfy.
There's a lot required on top of seL4 for it to be actually useful. Fortunately, there has been a lot of Open Source work there as well. This is where seL4 is in a much better position than just a few years ago.
For static scenarios, there's LionsOS[0], quite usable already.
For dynamic scenarios, there's the Provably Secure, General-Purpose Operating System[1], which is still early stages.
Both can be found in the Projects page[2] at trustworthy systems, which is linked in the seL4 website.
The guarantees offered by the kernel cannot be subverted by unprivileged processes running on it.
Of course, the kernel is not very useful on its own, thus the design of drivers, filesystem servers and other services running on top of the kernel is still relevant.
Note that, unlike most other systems (including Linux) which are flawed at a fundamental level, seL4 actually enables the construction of a secure and reliable system.
No, that's the advantage is that the kernel/processes don't need to be trusted since your kernel guarantees the isolation.
So you can have a Linux kernel running next to some high security process with the guarantee that they will be isolated (with the exception of allowed IPC)
I want the magical IOMMUs that are maturely secure like MMUs are now. For now, I think various efforts in verifying/restricting/generating drivers are far better, although they fall particularly flat for proprietary drivers.
>I want the magical IOMMUs that are maturely secure like MMUs are now.
There's nothing magical about IOMMUs. They weren't invented last week either.
Driver and hardware talk to each other using virtual memory instead of physical memory, preventing the scenario where a bug causes DMA to shit all over somebody else's memory.
What holds is that systems that run drivers in supervisor mode have not been able to leverage an iommu to its full extent.
My (admittedly limited) understanding is that IOMMUs still have practical roadblocks to being a solidly established part of the security of the computer. Of course they aren't bad in principle. Perhaps it's just that we aren't willing to eat the performance cost of making them more robust, but then performance is a tortured debate.
L4 was popular at my university (Karlsruhe). While I never really looked into it in any detail, it always appeared to me like a project that is primarily interested in testing some theoretical ideas, but not in building anything that would be practically useful.
That was 20 years ago. As far as I can tell, this has not changed. (Quick googling tells me there appear to be some efforts to build an OS on it, but they all look more like proof of concepts, not like something with real-world use.)
> OKL4 shipments exceeded 1.5 billion in early 2012,[4] mostly on Qualcomm wireless modem chips. Other deployments include automotive infotainment systems.[13]
> Apple A series processors beginning with the A7 contain a Secure Enclave coprocessor running an L4 operating system[14] called sepOS (Secure Enclave Processor OS) based on the L4-embedded kernel developed at NICTA in 2006.[15] As a result, L4 ships on all modern Apple devices including Macs with Apple silicon.
I feel like the two of you are talking past each other as the other poster is talking about time sharing systems like UNIX while you are pointing at embedded applications that are OS optional and if there is any semblance of an OS, it is really minimal. People deploying L4 there is not very surprising and does not apply to his remarks. Seeing this is like watching one person discuss an absence of apples and seeing another reply saying that there are oranges.
Ok, admittedly, that's more than I expected. And hey, if they found some use cases in the embedded world, fine.
Nevertheless, it's not exactly how it was pitched. This Microkernel thing always had a "Linux is doing it wrong, we're doing it better" story behind it. They wanted to be a replacement for a general-purpose OS. That cetainly hasn't happened.
The only still living variant is AIX. Some of the code is in the XNU kernel, although that is a monolithic kernel rather than a microkernel. Apple converted it to a monolithic design for performance reasons.
That said, this article suggests that AIX is not using a microkernel either:
The thing you have to understand about “AIX” is that historically it was not a single codebase, it was just a common brand for multiple unrelated IBM Unix offerings. (DB2 is the same story.) One of those offerings was AIX/ESA for IBM mainframes - and that was indeed based on the OSF/1 microkernel. But AIX/ESA died in the early 1990s-it was replaced by turning MVS aka OS/390 into a certified Unix, a role carried on by z/OS today, and later on by z/Linux as well. Whereas, the only AIX that survives to this day is the RS/6000 branch which isn’t based on OSF/1, and has a classic monolithic kernel design. Its distant ancestor, AIX version 1 for the RT PC, did use a microkernel (VRM), but that microkernel was very different from Mach and OSF/1 (it was written in a PL/I dialect, PL.8, that’s how alien it was)
Your OSF/1 link makes a lot of references to Mach 2.5 which still required the Unix portions to be in kernel memory space so it was still a monolithic kernel. This was the same as NeXTSTEP which is the basis for the current Mac OS X. It looks like OSF/1 did move to Mach 3.0 which was a microkernel but I'm not sure if the Digital Unix / Tru64 moved to that or not.
As for AIX, I used it on an RT around 1990 and then POWER based RS/6000 machines in the 1990's. It was a monolithic kernel.
Your own link mentions the failed Mach based Workplace OS and all the different "personalities."
From your link:
Internal discussion at IBM focused on AIX. Finally, at Comdex in 1993, IBM Chairman Louis Gerstner announced that the microkernel would not replace AIX. IBM realized that many AIX users would not accept performance penalties associated with microkernel systems. IBM was also concerned with the microkernel presenting a competitive impediment against high performance HP or Sun Unix systems than ran directly on the hardware. Instead, Gerstner told AIX customers that they would be able to migrate to Workplace OS, later if they were interested.
> With the introduction of the ESA/390 architecture, AIX/370 was replaced by AIX/ESA[28] in 1991, which was based on OSF/1, and also ran on the System/390 platform.
I think another way to look at it is that "AIX" is a larger term for a family of operating systems. The one that I used and probably the most popular was the monolithic kernel based on System V Unix that ran on the RT and POWER architectures.
The AIX version that would run on Mach for workstations was never commercially released. The AIX version for mainframes was based on Mach but AIX/ESA. You can see other versions like AIX/386
The entire "unix" part was updated with OSF/1 source code, with patches here and there of Free and NetBSD codebase.
But especially when compiling code that was written in C for "modern BSD" (Net/Free/Open) you can easily hit the part where userland etc. is OSF/1 branched off BSD 4.3/4.4 and simply does not include things people "assume" are present on BSD (like the more advanced linked list and tree macros)
Jochen Liedtke became a professor in 1999 in Karlsruhe, sadly he passed away only shortly after in 2001. I don't know if his successor Bellosa still does research on L4. There was the L4Ka project which appears to be completed. In the bachelor lecture on OS by him it's not part of the curriculum.
Rittinghaus, alumni of Bellosa, is involved with Unikraft [0], which was featured a couple of times on hn, and is using unikernel technology.
After Liedtke's passing, L4 research was continued in groups at UNSW (Gernot Heiser, formal verification -> SeL4) and TU Dresden (Hermann Haertig, Fiasco/L4Re, focusing on real-time and secure systems).
Genode (already mentioned in another comment) [1] came out of the TU Dresden group with some nice ideas around managing compartmentalized systems. Kernkonzept [2] is a startup commercializing the L4Re microkernel.
I love the work (and the direction) the Karlsruhe team did on L4Ka, especially with Pistachio - the design was clean, simple, easy to grasp.
I did a Pistachio-based OS for my diploma thesis (not at Karlsruhe). I always thought that if I'd been studying there, I'd probably go into OS research.
I also had ideas of an operating system design and the capabilities I had considered uses the same features of interposition and delegation as seL4 does. There are other advantages than just what they list there; for example, you can apply a filter to audio, or you can use proxy capabilities for implementing network transparency.
The real-time stuff is something I had considered as allowing as an optional implementation; my idea was for a specification rather than only a single implementation.
Another feature I wanted though, is that all programs operate deterministic except for I/O. Without I/O, you can't determine the date/time or how long a program takes, can't check for processor features (if you use something that the hardware does not support though, the operating system may emulate it), etc.
I had thought to use a mixture of hardware support and software support to implement it, though. (There is a note in the document about attacks with capability implemented in hardware, but I do not have the referenced document and I don't know if that attack applies to what I am thinking of doing.)
The point about security is that it seems to present the same failure as kvm is for Linux kernel. If the hypervisor is in the ring 0 you have the risk of VM escape from one to another or the host itself.
SeL4 is proof that microkernels are safe, efficient and scalable yet we are stuck with big honking Linux kernels in 2025. That said more and more drivers moving usermode anyway so it's a wash in the end.
I will caution that IPC microbenchmarks should not be taken as confirmation that the "academic microkernel" is viable: OS services all in userspace and with fine granularity as is appropriate. Often microkernel-like designs like VMMs/hypervisors and exokernels make use of unikernels/library OSes on top, which reduces the burden of fast IPC somewhat. Or developers intentionally lump protection domains to reduce IPC burden. Of particular note: even seL4 did not evict its scheduler to userspace. Since the scheduler is the basis behind time management, it's quite a blow to performance if the scheduler eats up time constantly. My own thoughts there are that, with care, a userspace scheduler can efficiently communicate with the kernel with a shared memory scheme, but that is not ideal. But for desktop and mobile, a microkernel design would be delightful and the performance impact is negligible. We need far more investment on QoS there.
Edit: That being said, we should be building microkernel-based OSes, and if for some cases performance really is a restricting factor, they will be exceptions. The security, robustness, flexibility, etc. of microkernels is not to be understated.
They verified that the scheduler doesn't interfere with the integrity, confidentiality, and authenticity requirements of the kernel so it's a moot point.
Rather, although I believe the seL4 scheduler is sufficiently general, I want a userspace scheduler to minimize policy. The seL4 team recognizes that a kernelspace scheduler violates Liedtke's minimality principle for microkernels, since the only motivating reason is performance. If an efficient userspace scheduler implementation exists, the minimality principle dictates no kernelspace scheduler. Otherwise there's pointless performance overhead and possibly policy inflexibility.
I understand the minimality principle but one should not be afraid to violate layers of abstraction when justified. The whole point of the minimality principle is to improve modularity and remove the need to worry about the correctness of various OS components.
What would you be able to do in a userspace scheduler that couldn't be done safely in an in-kernel scheduler? Why couldn't it be configured or given a safe API to interact with userspace at runtime? But I guess your shared memory bit is an API to a userspace scheduler?
Well, one thing about adapting seL4 is that we shouldn't assume derivatives will be formally verified to the same degree. Ideally we would, but I don't expect this to happen in practice. And having a kernelspace scheduler that is minimal and requires a userspace scheduler to achieve full control of scheduling, where a userspace scheduler would do, is redundant overhead. If the kernelspace scheduler wasn't minimal, it would both be hard to verify and extend. This is all in line with seL4's design. So you're concerned about power where the question was always about performance. Hydra had userspace schedulers way back when, and it wasn't followed because it was prohibitive then. The minimality principle might be given various explanations, but I think the end result of what a system following it achieves speaks for itself.
> I understand the minimality principle but one should not be afraid to violate layers of abstraction when justified.
Rather, in a well-designed system, the justification necessarily is performance. Abstractions exist for the ease of the developer to write code that is correct. Performance (runtime, memory usage, etc.) is the concrete impact of code. For concerns like debugging or flexibility, those arise if the abstractions are lacking. Performance concerns arise because even good abstractions impose overhead. Note that conventional layered architectures are usually poor abstractions for implementation. Good for design but not coding.
> But I guess your shared memory bit is an API to a userspace scheduler?
Indeed. I would rather not, but shared memory probably is necessary to some degree for a userspace scheduler.
Drivers in userspace is not particularly microkernelly - most of the major monolithic kernels have supported this to some degree or another for years (it is easy in principle, just transmit I/O requests to userspace servers over some channel) while many historic microkernels (see e.g. Mach 3) did not do it. It hardly changes the architecture of the system at all.
It is moving the higher-level things into userland that is the harder problem, and the one that has been challenging for microkernels to do well.
People should stop bringing up Mach so much. It should never have been the poster child for microkernels. It's poisoned the discourse when there are plenty of alternative examples. Granted, Mach also did some good work in the space, but its shortcomings are emphasized as if they reflect the whole field.
More to the point, drivers in userspace is an important distinction between "pure" monolithic kernels and microkernels: the former is optimizing for performance and the latter is optimizing for robustness. It's not about ease of implementation for either. It's quite meaningful to shift on the axis nowadays: it represents a critical pragmatic decision (notice purity is irrelevant). You're right that "higher-level things" such as the networking stack or filesystem are also crucial to the discussion. I think here, too, ease of implementation is not relevant, though.
Isn't it common to run OSes on desktop/server environment inside hypervisors? That means the OS itself can be transparently virtualized or preempted, and access to physical hardware can be transparently passed to the virtualized OSes. This can be accomplished today with minimal impact to performance on user experience.
The fact that this can be done with OS code not explicitly designed for this signals to me that there are no roadblocks to having a high-performing general purpose microkernel running our computers.
This leads to big units of code, namely multiple OSes, whereas the ideal is being able to use as finely granular units as developers are able to stomach. For example, Xen can work around the issue of device drivers by hosting a minimal OS instance that has drivers, but it is better to be able to run drivers as individual processes. This reduces code duplication and performance overhead.
You're absolutely right but I was making the point that there's demonstrably no significant downsides to running driver/kernel code in less privileged mode, rather than Ring 0.
To the best of my knowledge, seL4 is not AVX-512 aware. The AVX-512 state is not saved or restored on a context switch, which is clearly going to impact efficiency.
At present there's 16x64-bits of register state saved (128B), but if we were to have full support for the vectors, you need to potentially add 32x512-bits to the state (plus another 16 GP registers when APX arrives). Total state that needs moving from registers to memory/cache would jump to 2304B - a 1800% increase.
Give that memory is still the bottleneck, and cache is of limited size, a full context switch is going to have a big hit - and the main issue with microkernels is that a service you want to communicate with lives in another thread and address space. You have: app->kernel->service->kernel->app for a round-trip. If both app and service use the AVX-512 registers then you're going to have to save/restore more than half a page of CPU state up to 4 times, compared with up to 2 on the monolithic kernel which just does app->kernel->app.
The amount of CPU state only seems to be growing, and microkernels pay twice the cost.
The cost of moving 4 KB is miniscule. Assume a anemic basic desktop with a 2 GHz clock and 2 instructions per clock. You would be able to issue 4 billion 8-byte stores per second resulting 32 GB/s or 32 bytes per nanosecond. Memory bandwidth is going to be on the order of 40-60 GB/s for a basic desktop, so you will not run into memory bandwidth bottlenecks with the aforementioned instruction sequence. So, 4 KB of extra stores is a grand total of 128 additional nanoseconds.
In comparison, the average context switch time of Linux is already on the order of ~1,000 nanoseconds [1]. We can also see additional confirmation of the store overhead I computed as that page measures ~3 us for a 64 KB memcpy (load + store) which would be ~187 nanoseconds for 4 KB (load + store) where as the context saving operation is a 2 KB register -> store and a 2 KB load -> register.
So, your system call overhead only increases by 10 percentage points. Assuming you have a reasonably designed system that does not kill itself with system call overhead, spending the majority of the time actually handling the service request, then it constitutes a miniscule performance cost. For example, if you spend 90% of the time executing the service code, with only 10% in the actual overhead, then you only incur a 1% performance hit.
I agree that a well-designed system for most use cases won't have performance issues, since we should not just be optimizing context switches but also things like kernel bypass mechanisms, mechanisms like io_uring, and various application-guided policies that will reduce context switches. Context switches are always a problem (the essential complexity of having granular protection), and moving an extra 4KB is not negligible depending on the workload, but we are not out of options. It will take more programmer effort, is all.
I appreciate your criticism; it is reasonable and relevant in the present and the future. I wrote a reply nearby in the thread[0]. I think it's mostly solveable, but I agree it's not trivial.
High-assurance security requires kernels clear or overwrite all shared state. That could become a covert, storage channel if one partition can write it and another read it. If so, it should be overwritten.
I think NUMA management is high level enough that in a microkernel it would be comfortably managed in userspace, unlike things relevant to performance-critical context switches. And seL4 is currently intended only for individual cores anyways.
seL4 having a proof of correctness does not mean all microkernels do. In fact, seL4 is the only microkernel that has a proof of correctness. If you build on top of it in the microkernel way, you quickly find that it is not performant. That is why NT and XNU both abandoned their microkernel origins in favor of becoming monolithic kernels.
I’ve seen this argument play out many times. I believe the next line is: “QNX proved that micro kernels can be fast given clever message passing syscall design.”
“I remember running the QNX demo disc: an entire graphical operating system on a single 1.44MB floppy! Whatever happened to them?”
“They got bought by blackberry, which ended as you’d expect. QNX had a lot of success in automotive though.”
“Nowadays Linux and Android are dominant in new cars, though, proving once and for all that worse is better.”
Also an illustration how an open-source solution, even if technically inferior, would displace a closed-source solution, even if technically superior, unless there is a huge moat. And huge moats usually exist only in relatively narrow niches.
It turns out that success is composed of 90% luck, 10% marketing, and 5% talent/technical advantage. A rhetorical question: how do you entice people to turn a movement into a revolution when it isn't likely the movement will succeed?
Another rhetorical question: out of luck/marketing/technical advantage, which one is contributing the most to the extra 5% out of 105% of all the components success can be attributed to?
NT and XNU never had microkernel origins - Cutler explicitly refuted this at a conference in the early 1990s (if anyone remembers which, kindly share) and NeXTSTEP forked Mach at version 2.5, which was not a microkernel (see https://cseweb.ucsd.edu/classes/wi11/cse221/papers/accetta86...). XNU retained this monolithic architecture (as did Tru64 and other systems that trace their heritage to Mach prior to its version 3).
Various sources state that they rebased to OSF Mach Kernel 7.3, which was based on Mach 3 and parts of Mach 4. The OSF MK ancestry of macOS XNU can still be seen in paths:
This is quite a ship-of-Theseus problem. Steve Jobs hired Avi Tevanian to work at NeXT on the Mach kernel, and later brought Tevanian to Apple; he was still working on the Mach/Darwin kernel as late as the first iPhone release. There is a reasonable argument to be made that the OSF kernel is actually the derivative, by default.
As far as I can tell (this isn't gospel, just something I've inferred as the only reasonable explanation from comparing the codebases), Apple reconstructed a Mach 2.5 style kernel, like NeXTSTEP had, from fresh codebases (OSF Mach and 4.4BSD-Lite2), perhaps because 4.3BSD was still encumbered at the time.
The Darwin 0.1 and 0.3 releases contain the old kernel, derived directly from NeXTSTEP, and that's the direct derivative of Mach 2.5. The later XNU appears to be a reconstruction of that kernel with unencumbered code and that's also when IOKit replaced DriverKit.
In more recent interviews Cutler has been firm that the NT kernel was designed pragmatically, a view that Tevanian also evidently later adopted, as evident with the great wheel of incarnation that Darwin went through on the road to XNU, although I'm not sure Tevanian ever stated his perspectives on this matter publicly. Neither of these systems were ever true monolithic kernels in the Linux or Unix sense—at all times they both had some measure of sandboxing between e.g. driver code and the scheduler—rather sitting somewhere between.
True microkernels are, alas, more of an ideology than a practical reality, as the long-suffering GNU/HURD team discovered; Tanenbaum has been clear that the MINIX/NetBSD experiment was more about principles than performance. That said, certainly many hypervisors have attained success with configurations that coincidentally happen to be the same footprint as a microkernel.
From talking with some people who were witnesses to earliest days of Hurd (but not involved in the project), the real problems of Hurd were social not technical - if anything, OSF/1 & NeXTSTEP succeeding from similar code base (for some value of) points that it was possible to do it.
The story goes that the early days were dominated by some ridiculously bad project management that makes Cathedral and the Bazaar era "Cathedral" a nice and well maintained project.
Most importantly, GNU Hurd lacked the "ecological niche" to drive work towards it because it couldn't deliver enough, just like lack of information about 386BSD led to Linux (Linus explicitly said that if he had known about 386BSD, he would have worked on it and maybe forked it)
The hypervisor/microkernel boundary has blurred, and commercial success for microkernels is more or less as hypervisors, e.g. OKL4 is widely used in mobile devices[0]. And Tanenbaum's Minix is used for the Intel Management Engine. I don't know the details with HURD, but I don't think it is mainly suffering from "inherent" or "nearly-inherent" (strongly associated) problems of microkernels. The main cost of microkernels is developer effort and likely performance overhead.
NT and XNU never had microkernel origins - Cutler explicitly refuted this at a conference in the early 1990s
Interestingly, the Tanenbaum–Torvalds debate had Tanenbaum claim otherwise at the very start:
The alternative is a microkernel-based system, in which most of the OS runs as separate processes, mostly outside the kernel. They communicate by message passing. The kernel's job is to handle the message passing, interrupt handling, low-level process management, and possibly the I/O. Examples of this design are the RC4000, Amoeba, Chorus, Mach, and the not-yet-released Windows/NT.
NT incorporates a client/server model, like a microkernel, but isn't a "pure" microkernel like Mach. This gives rise to the 'hybrid kernel' term, which Linus dislikes.
Wikipedia has a decent depiction of the hybrid structure, but quoting from the NT Design Workbook:
> This specification describes the kernel layer of the NT OS/2 operating system. The kernel is responsible for thread dispatching, multiprocessor synchronization, hardware exception handling, and the implementation of low-level machine dependent functions.
> The kernel is used by the executive layer of the system to synchronize its activities and to implement the higher levels of abstraction that are exported in user-level API's.
NT in Windows 11 has very little to do with the monolithic kernel story that keeps being repeated.
Not only did the graphics stack moved again back into userspace, there is now a complete userspace drivers stack, and VBS (Virtualization-based security) for several kernel components that run on their mini Hyper-V island, talking via IPC with the rest of the kernel.
Likewise on XNU land, Apple has started a crusade already a few years ago, to move all kexts into userspace, no more drivers in the kernel beyond the ones Apple considers critical.
In both cases, they never were a pure monolithic kernel, due to the way walls were introduced with kernel level IPC to talk across modules, instead of straight function calls.
> seL4 is the only microkernel that has a proof of correctness
ProvenCore (https://provenrun.com/provencore/) has a proof that covers correctness (memory safety, and more generally absence of UB, termination, etc.), functional properties (e.g. the fork() system call is a refinement of an abstract "clone" operation of abstract machines), and security properties (memory isolation).
I heard a joke somewhere that sel4 is even more successful than Linux because it is running below ring 0 in every Intel chip shipped in the past N decades, plus probably many others.
Intel used a version of Minix rather than seL4 for its Intel Management Engine. [1] There was some controversy about this because they didn't give Andrew Tanenbaum proper credit. [2]
Such a dumb technical choice driven by stupid managerial considerations. They do this ring -1 shit because hardware companies view this as a cheap way to add value. But they don't open source it or contribute back because they view it as secret sauce. Minix as a result didn't get the investments that GPL software receives. Now the project is in hard legacy mode.
Not at all? If anything they’re filling the opposite role. Microkernels are about building interfaces which sandbox parts of the kernel. Namespaces are about giving sandboxed userlands full access to kernel interfaces.
Additionally a Linux kernel that exists for the sole purpose to keep KVM running, while everything that powers a cloud workload are Kubernetes pods, it is nothing more than a very fat microkernel, in terms of usefulness.
The dose makes the poison; we're still a long way from fulling embracing microkernels and capabilities. Security is a holistic property and encompasses finer details too. I want a small TCB. I want capabilities pervasively. And in pursuit of modularity and abstraction, I want to be able to choose the components I want and take those burdens myself. It's a bit silly seeing the nth SIGOPS-SOSP paper on how Linux can be improved by integrating userspace scheduling.
It is the same in safer systems programming languages, we already have the concept since 1961, but apparently making the industry take the right decisions is a tenuous path until something finally makes good ideas stick and gain adoption.
Microkernel does not mean it uses capabilities. And "very fat microkernel" is an oxymoron. The definition of a microkernel is that they do as little as possible in the kernel.
It’s not “some of the concepts” nor minutae, though, it’s literally the difference between a microkernel and a monolithic kernel.
Presenting capabilities / namespaces to userland is a completely different and in the case of Linux, orthogonal thing to presenting capabilities/namespaces to kernel services. I guess you could argue that the concept of capabilities came from microkernels, but when it’s applied to only user space, it’s just not really related to a microkernel anymore at all.
That’s basically the whole problem with capabilities and especially their application in namespaces from a security standpoint in Linux: they try to firewall these little boxes from each other but the kernel they’re all talking to is still one big blob. And this difference is meaningful in a security sense, not just some theory hand waving. https://www.crowdstrike.com/en-us/blog/cve-2022-0185-kuberne... is just one good example, but entire classes of mitigations are rendered meaningless by the ability to unshare into a box that lets an attacker touch exploitable kernel surface area which is not further isolated.
A microkernel by the book, is just a bunch of processes doing stuff that would be on the kernel otherwise, if we are then discussing minutae.
Nothing else, critical set of OS services are no longer hosted in a single process space, rather required a distributed system of processes running on a single computer node.
The moment these processes are able to be hosted in a set of computing nodes, we enter the realm of distributed OSes, is is another discussion.
The amount of services that remain on the mikrokernel, versus what is hosted in each OS process, running outside of the kernel, depends on each microkernel, there is no rule for this division of labour and each one that was ever designed has chosen a different approach.
If you cannot see the parallel between this and a swarm of containers running on a node, doing the actual workload and only requiring the kernel services due to the way KVM is implemented, and rather focus on a security mechanism that is detail on how Linux works, well, it isn't on me to further break it down.
OK, I get what you are saying now, and I agree if we are talking about KVM.
That's not how Kubernetes works by default, so I thought we were talking about container runtimes. I still disagree strongly from both a practical security and theory standpoint if we are talking about container runtimes implemented using namespaces and cgroups.
I don't think containers, namespaces, and the like failing to provide the same benefits of a true microkernel negate the OPs point. They are ways of segmenting userspace in a more finely grained manner and they do make attacks harder.
Linux security being a shit show and undermining these efforts is kinda besides the point: they are still attempts provide a runtime closer to what microkernels would naturally provide in a backwards compatible way. Indeed, these containers could be turned into fully fleged VMs if there were the resources to make it happen.
I don't really get this argument: "you're saying that one thing, namespaces, isn't implemented in any way resembling a microkernel, but what if we replaced it with another completely different thing, a hypervisor? Then it would be similar!" Yes? Sure?
To me the word "microkernel" expresses how the kernel is structured, not what userspace interface it presents. A microkernel is built by separating kernel services into discrete processes which communicate using a defined IPC mechanism. Ideally, a microkernel offers memory boundary guarantees for each service, either by using hardware memory protection/MMU and running each service as a true "process" with its own address space, or by proving the memory safety of each kernel service using some form of ahead-of-time guarantee.
Of course, doing this lends itself to also segmenting user-space processes by offering a unique set of kernel service processes for each user-space segment (jail, namespace, etc.), but there's no reason this needs to be the case, and it's by and large orthogonal.
I do agree with what I eventually understand the grandparent poster was trying to express, which is that running a bunch of KVMs looks like a microkernel. Because then, you've moved the kernel services into a protected boundary and made them communicate across a common interface (hypercalls).
But that's not how Kubernetes works by default and in the case of containers and namespaces, I think this is entirely false and a dangerous thing to believe from a security standpoint.
> They are ways of segmenting userspace in a more finely grained manner and they do make attacks harder.
From a _kernel_ security standpoint (because we are talking about micro_kernels_ here), I actually think namespaces make attacks much easier and the surface area much greater. Which is basically the entire point I was trying to make: rather than exposing fragile kernel interfaces to exclusively system services with CAP_SYS_ADMIN, you now have provided an ability (unshare) for less-trusted runtimes to touch parts of the host kernel (firewall, filesystem drivers, etc.) which they would normally not have access to, and you have to go back and use fiddly rules engines (seccomp, apparmor, selinux) to fix the problem you created with namespaces.
To be clear, I think from a big picture standpoint, it's a tradeoff, and I'm nowhere near as anti-container/anti-namespace as it may seem. I just get annoyed when I see people express namespaces as a kernel security boundary when they are basically the exact opposite: they are a kernel security un-boundary, and Linux's monolithic nature makes this a problem.
I suppose I should ask the other side too, though I am biased to favor microkernels, and am better read on them, but how so?
"Safe enough" measured by the standards of this upside-down industry...I'll let everyone decide that for themselves.
"More efficient": while monolithic kernels have a higher ceiling, currently there's plenty of OS research papers and industry work demonstrating that more integration with userspace brings performance benefits, such as in the scheduler, storage, or memory management. Microkernels encourage/enforce userspace integration.
"More scalable": I think this has less to do with kernelspace scale and more with how nodes are connected. See Barrelfish[0] for eschewing shared memory in favor of message passing entirely, running separate kernels on each core. Meanwhile Linux has gradually discovered a "big lock" approach is not scalable and reduced coarse-grained locks, added RCU, etc.. So I think we will go moreso towards Barrelfish. But on a single core, for what goes into the kernel, that's covered by everything but scalability.
SeL4 is old news - not a criticism, but has anyone added another formally proven layer or component? (Edit: I mean new components beyond the microkernel, not improvements to the microkernel.)
Also, I suspect some people - maybe some on HN :) - get emotional overload when they see the word 'proof' and their intellectual functions stop. It's not a panacea for the infinite problem of secure IT; it isn't a way to create or discover a perfect and flawless diamond of software. IIUC it means it's proven to meet specific requirements in specific conditions, and those requirements and conditions can be quite narrow; and it says nothing about other functions and conditions that are out of spec. Is that roughly correct?
What does it mean in practical terms? What does a security professional see when they see 'formally proven software'?
What are the specs that SeL4 meet (no, I haven't looked at the OP in a long time)? Isn't that the essential information here?
Being "formally proven" to be free of various flaws did not make seL4 immune to memory corruption flaws. Despite the formal proof, a memory corruption flaw was found a few years ago. The PRs for the commits fixing it and amending seL4's proof are public:
https://github.com/seL4/seL4/pull/243
https://github.com/seL4/l4v/pull/453
You will find a number of other memory related bugs in its issue tracker:
https://github.com/seL4/seL4/issues?q=is%3Aissue%20label%3Ab...
Interestingly, the PR fixing "register clobbering" in memory was not labelled bug, so it is not listed when you filter by "bug".
I used to think seL4 was immune to these issues given its proof allegedly saying it is, but upon seeing this, I have come to think that the proof is not as comprehensive as the wider community has been lead to believe. That said, seL4 is still a very impressive piece of software.
Finally, to answer your question. The specification that seL4 meets is published on github:
https://github.com/seL4/l4v
I do not see how that is a kernel memory corruption. I do not understand the code in context, but it appears to just be a case where the kernel incorrectly clobbered a userspace register which could result in incorrect userspace code execution. A error in that class would not inherently violate the system safety properties of the kernel itself.
If so, that would only fall under a functional specification problem in the userspace API making it hard to use the userspace API in the intended/desired way. That would make it hard to use the kernel and verify your system using the kernel, but it would not constitute a kernel memory corruption which is what safety specifications and proofs of the kernel itself would be concerned with.
I said that a memory corruption flaw was found in seL4. This is a fact. If you read the patch, you will see that a message that was just created had been being overwritten (which is explained in a code comment). This occurs inside the seL4 kernel code, and the proof previously permitted this behavior (according to one of the developers’ comments on the PR). The proof had to be amended to disallow this (which is the second PR). You should see this in the links.
You can not just assert it is a memory corruption without explaining how it is a memory corruption.
It appears to be a case where a wrong, but legal output is being stored. That does not, inherently, constitute a memory corruption.
For example, suppose I have a sorting function. I pass it a vector of integers by reference and I intend it modify it into a sorted vector of integers. Instead, that function zeros every element of my vector. That is a sorted vector, so it technically meets the specification. That function “corrupted” the contents of my vector. That is not a “memory corruption” bug. That is just a plain bug where I output the wrong thing into memory.
A memory corruption bug needs to violate the explicit or implicit invariants of the memory being “corrupted”. Out-of-bounds write, use-after-free, bypassing class interface, unguarded shared parallel access, etc. Writing incoherent, but legal values through provided pointers to uniquely owned memory is wrong, but not “memory corruption”.
Furthermore, this is not a definition game where we are disagreeing on the definition of “memory corruption”. They do not claim blanket safety from “memory corruption” which may be a class up to interpretation. They claim safety from [1]: buffer overflows, null pointer dereferences, pointer type confusion, memory leaks, and other specifically named and defined forms of memory corruption errors. Please identify which specific enumerated memory safety property was violated.
Again, I am not asserting it is not a memory corruption bug of some class. But it is very unclear from the patch and comments that the stated bug was actually a memory corruption without significantly more contextual detail. I literally do kernel development professionally and even I can not tell from the patch out-of-context; there is no way almost anybody else would be able to tell without more detail.
[1] https://sel4.systems/Info/FAQ/proof.html
See the FAQ entry on seL4 having zero bugs:
https://docs.sel4.systems/projects/sel4/frequently-asked-que...
Most who read that would expect the bug fixed in the linked GitHub PR to be impossible. The loophole was that the specification had a flaw, which was only mentioned as a vague hint about “unexpected features”.
You keep dancing around the question of it being a memory corruption despite confidently asserting it as a clear, indisputable fact. Please support your claim by identifying the specific memory safety property that was violated and how it was violated.
I provided citations. Use them and do your own research. I think you will find the answers in what the seL4 developers published, but I have no interest in spending days or even weeks pointing to things until you are satisfied.
You made a strong claim which you have failed to support adequately and yet have continued to double down on.
I did my research and presented my own interpretation which fundamentally disagrees with your interpretation. You have refused to address it, instead continuing to assert your interpretation without further detail. That does not prove my interpretation is correct, but it does indicate that the citations and evidence you have provided do not clearly lead to your conclusion and you neither identify the errors of my interpretation or directly demonstrate the correctness of your interpretation which could be done by demonstrating a simple counter-example demonstrating what memory is corrupted.
You claim it is indisputable fact, yet you claim it would take "days or even weeks" to present a clear example of the corruption. I do not see how you can claim the bug is clearly and indisputably a memory corruption bug that anybody can see from the citations you provided when it would take even you so long to provide a clear example.
I will say this. The seL4 developers made the strong claim. Evidence that it is not as strong as people who heard about it thought is not a strong claim. It is just being realistic. The seL4 developers themselves say that the properties they list are things that come out of the proof and your approach of looking for properties that they promise is a backward way of going about things, since they did not claim to exhaustively list all properties.
If you want to understand things to your satisfaction, read the proof and figure it out yourself. I have already understood things to my satisfaction and the double digit number of upvotes that I have received suggests to me that my explanation was to others’ satisfactions as well. You likely have many days or weeks or work ahead of you to answer a passing curiosity of yours whose precise answer is not going to change anything for anyone. Have fun.
that's not how that works.
writing into memory that doesn't belong to you is memory corruption. in this case: kernel (induced) memory corruption.
and it's really a bad thing:
functional safety technically boils down to the job of the kernel to provide "spatial and temporal freedom from interference", meaning all subsystems are isolated from each other. no "process" can write to another process, no process can starve another process. in this perspective, the kernel is another (special and privileged) process on the system.
a kernel that randomly writes to user space processes is inherently unsuitable for functional safety.
it's a huge no no. a failure in the goals and purpose of the kernel.
and finding this here specifically was heavily shaking up confidence in formal specifications and formal proofs of compliance to these specs.
Again, the nature of the bug has not been appropriately established. It appears to me to be a bug where a output register that the kernel is expected to write is written with a incorrect, but legal value in certain cases. If my interpretation is correct, which it might not be, that is just a wrong output not a "memory corruption". The claim of "memory corruption" needs a clear detailed explanation of what memory safety property is violated and how it violates that.
Second, no. The spatial separation guarantees you want from a kernel are that a domain can not induce spatial interference in a different domain that is isolated from the former. Internal memory corruption of a user domain, even uncontrolled asynchronous corruption by the kernel, makes it not very useful, but does not violate the protection property. This is a case where it is important to distinguish between "safety" and "usability". It is "safe", but not very "usable".
If your memory corruption only occurs during specified system calls, then that problem is further reduced making it easier to "use" the system. If the memory corruption is highly constrained, say it only corrupts in-bounds of passed in pointers, then it is easier to "use" once again. If there is no memory corruption, but it instead sometimes outputs invalid garbage, that is easier to "use" once again. If it outputs legal, but incorrect values, that is easier to "use" once again. None of these affect "system safety", but all are important bugs to fix to enable to creation of functional, usable systems.
fascinating. maybe terminology issues?
where I come from, the kernel must not, ever, write to unwanted locations, even if in theory it could.
wrong data as you point out is a completely different point of discussion, that could be a bug in the function which may or may not be safety critical , but hey, depending on the function you call, a wrong value could also come from some other process and the kernel is just the messenger of false data, not it's origin. no, "wrong value" that's not what I mean.
and a bug in the user process, writing data to the wrong place, also is not what I'm talking about.
what I consider violation of the functional safety requirement of spatial freedom form interference on the kernel is and unexpected "wrong location": the kernel "surprise surprise" writing to some address or some register (!) the kernel was not supposed to modify --- these _both_ are a violation of spatial freedom from interference, in the functional safety sense, to my limited understanding. "your dance area, my dance area" type of violations.
this reading comes from discussions about how if at all it might be possible to achieve this with a monolithic FOSS Linux kernel. There the page table can be set up to give quite some protections, even protecting user space from the kernel, but then who ensures the page tables aren't corrupted "somehow" by some rando kernel module? And similar discussions exist to reliably save and restore user space registers on context switches to the kernel...
but that's not what I wanted to get into, my point rather is: user space must be safe not only from other processes but also from the kernel. and that's something a microkernel can prove much easier than a monolith.
It is a nuanced point. Certainly, you want to have a API that conforms to specifications and does not ruin your desired program invariants. But what if it does? What if you make a specification error? How do you modularize your specification so errors in one part of the specification do not invalidate the rest?
For instance, suppose I have a kernel with spatial safety. I then add the system call, “Corrupt random memory in my own process at a random time”. Is that kernel still safe? All that function can do is destroy the invariants of the calling program. It can not turn a innocent isolated program malicious no matter how malicious another process is or became. What if nobody calls that function?
It would certainly be hard to use such a function for any purpose. It may allow a malicious input crafted by a communicating malicious process to cause a problem in a innocent, but incorrect program. But a correct, isolated program, using APIs correctly would be unaffected no matter what any program does even with access to that new “memory corruption on demand” API.
So, in this case, it would be fair and reasonable to classify the kernel as safe despite having a absurd system call in its specification. And, as basically any other API with a memory corruption is going to be less absurd, it only makes sense to classify those as “safe”, though very likely “unusable” or “hard to use”.
The distinction then becomes being able to cause memory corruption in isolated, correct programs. In that case, a program is no longer safe even if it proves itself correct. You must prove the entire system is correct and all other programs correct to establish any guarantees about any individual program. You can no longer “isolate” or “separate” your analysis which makes the problem even more intractable. That is the first problem you want to solve which is what this definition is intended to achieve.
a nuanced point indeed.
and we agree a kernel executing write_rando_stuff_to_rando_places out of the blue would be unsafe?
do we also agree registers to be auch random places?
then we're good
The PRs make it clear that this was a bug in the spec, and the implementation behaved according to spec.
That resulted in a memory corruption flaw in the C code. If you build a bridge to a specification and the bridge collapses, does saying that the specification was wrong in any way lessen the reality that the bridge failed?
What's odd about this argument is that such bugs would be at risk of reintroduction in any other piece of software. Indeed, each release is expected to introduce new bugs and vendors typically sell service contracts based on the perpetual brokenness of existing software engineering processes. Whereas formally verified software typically has single digit numbers of bugs ... ever.
Sure, you get to score rhetorical points about the proof being incorrect once. But this bridge is now fixed and won't fail that way again in the same way.
Cartels and governments regularly hack phones and spy on and even kill activists. The current processes you are defending get people killed and are undermining civil society. SeL4 is a big deal and definitely worth the investments needed to make such reliability the norm instead of accepting the continual bridge collapses that occur in Linux land.
> Cartels and governments regularly hack phones and spy on and even kill activists.
Your comment implies SeL4 will prevent these things from happening. Isn't that going way too far?
It's a nice start, but we're nowhere near securing things on that level. Without looking into it, I'd put more faith in iPhone's lockdown mode.
We have a long way to crawl, but process isolation is the bedrock of all other security guarantees.
Tell me how process isolation prevents SQL injection. I could ask about any of a great variety of security issues, but I think SQL injection is a great choice for debunking “process isolation is the bedrock of all other security guarantees”. We do not have to worry because we have process isolation is basically this:
https://web.archive.org/web/20240123122515if_/https://www.sy...
I suspect you did not mean to say that, but your comments can be interpreted to mean that, especially when you suggest people will die if seL4 is not used as if nothing else matters, and inject that into a discussion of whether seL4’s is free of flaws (it is not).
Using seL4 is like flattening the ground on which a building foundation will sit. Just because the ground was flattened to within microns of flatness does not mean there is nothing else wrong with the foundation and building that are built on it.
Perhaps I am being pedantic, but you deserve this after suggesting the mere act of giving an opinion on seL4 made me an accessory to murder.
"Tell me how process isolation prevents SQL injection."
You need process isolation to make the system secure enough to even require SQL injection. Otherwise, they can send packets to attack anything else on the system, get full privileges, and swap your SQL-enabled application with their own. While taking your database.
Further, techniques to block SQL injection work even better when they're in an isolated partition the attacker can't escape from. Even better if data can only flow in a specific way. Even better if it can be reliably cut off and recovered upon detection of a failure. That's why high-assurance guards used to sit between sensitive systems and untrusted networks. They did all of these things.
Now, for application-level issues. The Separation Kernel Protection Profile, which inspired designs like INTEGRITY-178B and seL4, called for secure middlewhere to protect communications between partitions or systems. That included message routers that enforced the security policy within the system. That would include protocol-level defenses, like for SQL requests, to protect communications outside the box.
seL4's isolation was, by itself, never intended to stop unrelated attacks. The vision, described in NICTA's Trustworthy Embedded Systems program, was to combine many components to do that. I'll add that seL4 team did a great job documenting the many points they may fail in their Assumptions page. They were much more open about it than most competing projects.
The systems we're talking about don't use SQL to communicate. They use data structures in a wire format of some sort because they are embedded systems. Don't pretend like we're talking about web development here, because we're not. SeL4 is designed for hard real time embedded systems and uses formal methods, which have and continue to be a mainstay approach in places which value security and reliability. And they really mean _reliability_ as in reliably wrong or reliably right.
So being able to adjust the spec and fix the system is actually a feature because it means your prover acts as a high-quality test.
As always, with systems like this if you use a formally-proven operating system kernel you will probably assess which other code paths in your application code are on the critical path and use formal methods on those components. And you will also be doing other less formal work like FMEA to complement. So you're not reliant on any one single method of assessing whether the system works, whether it meets the specification, and whether it can be relied upon to do the right thing in a bad situation.
https://xkcd.com/1200/
While seL4 is good at what it does, what seL4 does is extremely minimal and it is increasingly obvious that a vocal minority overrates it. Using seL4 is not going to stop vulnerabilities in userspace or even in drivers that were previously in kernel space, or in VMs running on top of it, but people who claim systems are insecure without seL4 insinuate that it will. They are like the Apple Store employee who once suggested to me that if I run Windows on Apple hardware, it will have none of the problems Windows normally has.
"process isolation is the bedrock of all other security guarantees" means process isolation is a necessary (in practice) condition, not a sufficient condition.
"Tell me how process isolation prevents SQL injection." is a non-sequitur because that is only coherent if they claimed process isolation is a sufficient condition.
To disagree with their position, you need to demonstrate a system where there is no process isolation and where maliciously crafted inputs intended to cause SQL injection can not result in undesired database access. So, one process gets to arbitrarily modify the contents of the SQL server process including changing the code and you need to prevent it from wrecking your database.
As I previously said, I could have cited just about any common vulnerability and process isolation would have done nothing about it. For example:
https://owasp.org/www-project-top-ten/
I only cited SQL injection because I could not think of any claim more absurd than the claim that process isolation is bedrock of guarantees against SQL injection, but if process isolation were the bedrock of all other security guarantees, then it would some relationship to protection against sql injection.
You do not appear to understand the difference between "necessary" and "sufficient".
Process isolation is not sufficient (i.e. adequate on its own) to make a SQL server secure against maliciously crafted inputs intended to cause SQL injection for the purpose of undesired database access because it does not sprinkle magical security pixie dust on it.
Process isolation is necessary (in practice) to make a SQL server secure against maliciously crafted inputs intended to cause SQL injection for the purpose of undesired database access because otherwise a attacker can just modify the code of the SQL server to directly do the undesired database accesses. Without process isolation your SQL server security/design can be bypassed.
The amazing thick walls of your castle do not matter if you built it on sand. The amazing solid bedrock is not sufficient to claim your castle is defensible if your castle is a wooden shack, but it is necessary for building a defensible castle.
A lack of process isolation does not allow your SQL server security to be bypassed when the adversary is not on the machine in the first place. The adversary would usually fill out a field on a form, which is where SQL injection attacks usually occur:
https://xkcd.com/327/
OSv for example does not have process isolation, yet applications on it are not considered insecure.
I read "bedrock" as 'necessary to all other security'. But arguing about word choice isn't useful; the question is what indolering actually meant.
What? You'd put more faith in a kernel that has no formal verification of its conformance to a secure spec and has orders of magnitude more code in it that can go wrong? What is the basis for this confidence?
Does it even matter when the attack surface is in userspace? Formally verifying seL4 against a potentially buggy specification does not provide any assurances about the user space code that must run on top to produce a working system.
If you subscribe to the idea that the mere use of seL4 makes things more secure, you might use seL4 as a hypervisor beneath Linux on machines that would previously have run Linux on bare metal and then you would be using seL4, but you would not have made anything more secure. The increase in the TCB would suggest such a setup is less secure.
The security of a machine using seL4 is dependent on what runs on top of it because seL4 is not the boundary between attackers and their targets. The only exception would be if you are a virtual machine provider and then seL4 would only be part of the boundary, rather than the whole boundary.
> but you would not have made anything more secure.
This is does not follow. Why do you think a hypervisor with no verification to a spec is necessarily more secure than a hypervisor with a verification to a spec? Can you somehow quantify the latent bugs in the non-verified code?
The “that would previously have run Linux on bare metal” part of that sentence was important. Try reading it again. It never compared seL4 as a hypervisor to another hypervisor. It compared using seL4 as a hypervisor to not using a hypervisor at all on a machine that does not need a hypervisor. The addition of seL4 there is not only pointless, but possibly harmful if you consider the increased TCB size.
By the way, seL4 as a hypervisor uses components that are not verified to a specification, so saying that it is verified to a specification is not quite correct. They also seem to be proof of concept examples rather than a production solution when I look at their documentation:
https://docs.sel4.systems/projects/camkes-vm/
https://docs.sel4.systems/projects/camkes-vm/centos
Why is it that everyone on the “seL4 makes everything secure” bandwagon seems to have a tunnel vision that ignores that seL4 is useless without a large amount of unverified software that runs on top and that security will typically depend on that software, rather than seL4?
While seL4 is an amazing achievement, it is clear that it is not the bulletproof software that people thought it was. There is no proof that the specification is problem free after all.
This is a gross misrepresentation of what I said. Having an objective evidence based opinion that seL4 is not as good as advertised has nothing to do with people dying. seL4 does not achieve this. It is the equivalent of less than 0.001% of all of the code run on machines. It makes no guarantees that the >99.999% that people use is okay.> If you build a bridge to a specification and the bridge collapses, does saying that the specification was wrong in any way lessen the reality that the bridge failed?
You framed this as a loss of confidence in the techniques that prove the absence of various flaws. This bug should not undermine such confidence as the bug was always in the place (the spec) that would allow such bugs.
When seL4’s proof was published, people regarded seL4 to be flawless software and the fact that a problem in the specification would lead to issues was largely disregarded.
Now that it is clear that the specification has had problems, seL4 should not be considered issue free, since there is no proof that its specification is issue free and there is proof that its authors made a nonzero number of mistakes.
Anyone minimizing the impact of this flaw by saying “the specification was at fault here” completely misunderstands why the rest of the industry thought highly about these techniques in the first place. They had promised to produce flaw free software. Now that it is clear that is not true, the level of confidence the industry places in these techniques naturally needs to be adjusted. We cannot ignore the massive caveat of the specification needing to be flaw free anymore.
> When seL4’s proof was published, people regarded seL4 to be flawless software
No, that's not true. The framing was always that it surpassed an incredibly high safety bar that no other kernel had surmounted, far beyond even security certifications needed for protecting national secrets. This is still true.
> the level of confidence the industry places in these techniques naturally needs to be adjusted
No it doesn't, people using formal methods understand perfectly well what confidence verification should bring and the caveats that are necessary. If by "industry", you mean Joe-know-nothing-about-verification-shmoe, then I'm not sure why his opinion was particularly important to begin with.
You can deride people who do not do formal verification, but I guarantee everything you use has unverified code written by people who do not do formal verification. If we restricted computers to formally verified code, we would not have working computers, embedded or otherwise.
the point are people not using formal methods yet. automotive for example continues amazing aspire iso 26262 rain dances to tell themselves systems were safe. and does not allow to replace that with formal specs and proofs of compliance. that's just complementary the "real safety work" has to be done with paperwork v model tracing.
Many bridges had to collapse until we learned how to build them.
As for seL4, the proofs (which are actually the state of the art in that field, yet definitely can be further perfected) are just one of the pillars supporting it e.g. the size of the TCB is another.
We do not give up on the goal because there was a mistake once. Nor is the goal not worthy; this is the absolute best the world has atm.
I did not suggest giving up, but I did suggest that seL4 is not immune to the bugs it’s advertising claims to avoid.
I believe only certain versions and on certain architectures is seL4 verified for. There are no bugs found at the C source code level for these builds of seL4.
This issue appears to have affected all architectures. This issue was present in the specification against which seL4 was verified. You can say that there are no bugs by virtue of it following the specification, but if the specification was wrong and in this case it was, then were there really no bugs?
There is always a specification, the question is "was this issue found against a version of seL4 that had been fully specified or not?".
I worked at the lab, I wasn't aware of any bug/issue on the fully specified kernel, that is why I am unsure if this counts or not.
I would need to have a look at the source code and proofs to confirm.
How does “This issue was present in the specification against which seL4 was verified” not imply yes to that question?
Even if it's not exactly seL4, there's good value in taking inspiration for design elements. It would still be a lot more robust than commodity operating systems.
We should look at formal verification like everything else: in terms of statistical effectiveness. It's not really important whether or not there are _no_ bugs. What's important is how many bugs there are for each unit of "effort."
For example, if this were hypothetically the only bug that was ever found, then that would be a pretty damn good argument that formal verification is effective. Because the bug rate of other non-verified operating systems is much higher.
seL4 in Rust?
Has anyone added another formally proven layer or component? Yes, they're being added all the time. Recently added features:
- Support for a bunch of new architectures, including RISC-V. - Mixed criticality scheduling, which provides capability-based access to CPU time, mechanisms to limit the upper bound of execution of a thread, ensuring that high-criticality tasks have priority and access to necessary resources, and allowing "passive servers" which run on scheduling time donated by the caller. - Microkit, a new verified abstraction layer which made it much much easier to build actual systems using seL4. - Very recently, the Device Driver Framework, a device-driver template, control and data plane implementation, and tools for writing device drivers and providing device virtualisation for high-performance I/O on seL4.
Formal verification can guarantee that specific requirements hold in specific conditions. It's true that in general such requirements and conditions can be quite narrow. But seL4 itself has a whole lot of proofs about it, covering a wide range of properties that one would want in a kernel, and these guarantees hold under very weak assumptions. Even the correctness of the C compiler is not assumed, they have a separate tool which looks at the compiler output and proves that the compiled binary behaves correctly with respect to the required C semantics.
The requirements that seL4 meets include the following: the binary code of the seL4 kernel correctly implements the behaviour described in the abstract specification and nothing more. There are no buffer overflows, no memory leaks, no pointer errors or null pointer dereferences, no undefined behavior in the C code, no termination of the kernel apart from the explicit ways enumerated in the spec, etc. The specification and therefore the seL4 binary satisfy the security properties of integrity and confidentiality: the former means that there is absolutely no way for a process to change data that the process has no explicit permission to change, and the latter means that a process cannot, in any way, read data it has no explicit permission to read. It even shows that a process without permission cannot indirectly infer the data through certain side channels. Beyond security, the expected worst-case execution-time guarantees and scheduling properties are also met.
The developers of seL4 have been in funding hell for years. Most of their work was darpa research for remotely controlled drones. The US Military would very much like drones that can't be hacked.
Their current work is on LionsOS which is more towards greater adoptions: https://lionsos.org/
>The developers of seL4 have been in funding hell for years.
The actual story re: happenings in CSIRO is detailed in Gernot's blog[0], which is also quite interesting in its more technical posts.
0. https://microkerneldude.org/
For example: No buffer overflows, null pointer exceptions, use-after-free, etc. On ARM and RISCV64 not even the C compiler has to be trusted, because functional correctness has been proven for the binary. And there are more proofs besides functional correctness. https://docs.sel4.systems/projects/sel4/frequently-asked-que...
https://github.com/auxoncorp/ferros
Lots of type-level programming for tracking resources, hardware access, and capabilities at compile-time and trying to bring some of the underlying kernel guarantees up to the compiler because finding out about and debugging issues at runtime was just the absolute worst.
In high-assurance, formal methods were one of many used for the system. It was reduced complexity, precisely-specified behavior, likewise for security policy, proofs the design/code fulfilled it, exhaustive testing, leak analysis, pestering, and ability to recheck all this.
That's so hard it was mostly done for only kernels commercially. From there, your real-world use will involve user-level components, protocols with interactions, and so on. Add one thing to the proven TCB and maybe your proof disappears. So, every trusted component has to be rigorously designed. Then, all the interactions for a holistic argument.
Most don't do that. For example, seL4 hosting a web browser won't help you if it's a web application. IBOS and a WAF on seL4, all proven, might cover that. Harder argument. Most of these projects never go beyond the kernel or TLS stack or whatever.
You can do stuff with that. Look up Nizza Secure Systems Architecture for a paper with many examples. Yet, they usually don't cut it. They're not compatible with popular apps either. So, most people just harden Linux or BSD or Windows.
> IBOS and a WAF on seL4, all proven
If you mean web application firewall, that seems too complex to be proven? With infinite possible inputs, inspecting it would seem to require too much code to prove - unless it merely verifies traffic matches a short whitelist.
Why talk about high assurance talked in the past tense? Is the field moribund? (I don't mean to point out a typo, if that's what it is.)
It could potentially be proven. We think it's unlikely due to cost or complexity. So, the fields default position was that most things would be low to medium assurance. Only select components would be high assurance. I hope that answers your broader question about adoption of seL4-like techniques for larger systems.
Far as past tense, I left high-assurance security when I found out Jesus died for my sins and rose again. Had to learn an entirely new life with a focus on human beings instead of tech. I'm recently considering going back into security since I spend up to 12 hours a day at work anyway (two jobs).
I should also note another reason I say past tense. Many people I learned from have died, retired, or switched focuses. The reason for the later is that there's no real market for high-assurance security. Most won't spend the money or make the tradeoffs. So, security-enhancing popular platforms while assuming they'll get hacked (and responding correctly) seemed like a better use of time. I mean, I could probably could all the high-security vendors on my hands...
Congratuations on your big change, and on your courage. Security can help people, of course; it just depends on who you work for?
Thank you! And, yes, security can help people when your job allows you to do that. I was talking about high-assurance security specifically. That's products built with techniques like B3/A1 or EAL6/EAL7 required. Examples:
https://www.aesec.com/classic/DesigningTheGemsosSecurityKern...
https://lukemuehlhauser.com/wp-content/uploads/Karger-et-al-...
https://apps.dtic.mil/sti/pdfs/AD1014758.pdf
https://ghs.com/products/safety_critical/integrity_178_secur...
The market for this is tiny. NSA once estimated the labor force for developing them was under 200 people worldwide. I know maybe a dozen or so companies even selling such things. Given tough tradeoffs (eg less features), the buyer market is tiny even though development cost is higher.
Medium assurance is the best route commercially. That really just means a solid, development process with selective use of assurance-improving techniques. People writing key parts of their code in SPARK Ada or Rust with careful, design reviews is an example. Certain products, like HAProxy or OpenBSD, are coded so well they rarely have problems. Some web frameworks have security baked into them. Using such approaches is the most likely to succeed commercially. Others use architecture or hardware to great advantage, like Hydra Firewall or Secure64's SourceT.
The fields last idea for commercial high-assurance was "selfless acts of security" per Bell. He thought, one component or system type at a time, companies with the money should sponsor experts to build exemplars. They get maintained over time at hopefully lower cost. They're designed to be integrated into low-assurance systems to boost them. While using hybrid funding, CompCert, miTLS, and Muen SK are examples of this.
I just rarely see this happen. Both companies and FOSS favor ease of development or velocity over assurance of artifacts. The market usually requires integration with insecure systems whose integration itself might be impossible to secure. So, I have no optimism about high assurance but selective improvements are doable.
The document was updated in January.
A lot has happened in the last few years, such as the founding of seL4 foundation, the maturation of mixed criticality scheduling, the celebration of multiple seL4 summits, and the deployment in practical use by multiple companies and countries, in several scenarios.
Notably, Apple joined the seL4 foundation, as they use it in several of their products: https://sel4.systems/Foundation/Membership/ (Not sure if they've stated publicly which, but it's been pretty well known for a while now).
I’m a fan of microkernel hosts for guest monoliths, thus our servers are running seL4 as safety layer and backups for FreeBSD VMs with jails for renderfarm, BEAM clusters and Jenkins.
All I’m missing is an ARM port of DragonflyBSD for its threading and in-process kernels (hybrid kernel design). My dream is to run it on 128 cores Ampere Altra to run OpenMoonRay more effectively.
Could you elaborate on how your servers use seL4? And are these production, commercial servers?
Yes, commercial production servers. Similar to some other systems and certain consoles, microkernel is used as hypervisor that protect bare metal hardware from potential hazards when guest monoliths like mentioned FreeBSD becomes compromised. Guest components are running our software and services inside jails on two 128 core, two 64 core and one 32 core Ampere processors where first two machines run heavy workloads like offline rendering and Erlang VM clusters, while others are used to provide a few VPS servers, and internal services.
I thought support for multi-core was still lagging? Or are you just running the unverified version?
Thanks. That's similar to prior uses. Makes sense.
I think some readers next question might be: are you all hiring?
We're not hiring at the moment.
IMO, you would be crazy to pass up nickp as an employee.
Heh, I’d love to, but we’re small Scandinavian business with not enough revenue to cover salary of such person yet (2-3m isk/mo).
Thank you anyway.
Sounds like your setup would make for an interesting read in longer form
I think at this point the argument for or against microkernels is becoming moot. The only way to provide fast, efficient and safe access to privileged services is via hardware mitigations. Software can only do so much.
It's like the difference between a 80286 and a 80386: The latter added the hardware support for true multitasking that the former lacked. Since then there's been an ever increasing number of hardware-level protection mechanisms added, like those which enabled hypervisors.
Apple in particular has been adding a bunch of stuff to their SOCs to protect the kernel, drivers and components at the chip level, as well as enforce privileges on running threads and when using pointers. [1] This doesn't mean the OS is impenetrable, but it's a lot more effective than a software only strategy of managing privileges.
Seems to me that utilizing this stuff (or similar), the architecture of the kernel really isn't that important any more.
Am I off base?
1. https://support.apple.com/guide/security/operating-system-in...
> Am I off base?
Yes. There is still plenty of work to be done in the OS research space. There has to be software interfaces and API's to all this new hardware.
I also think there is a lot to be learned from using micro/hybrid systems which are much more composable. e.g. Plan 9 is a great example of a hybrid system which makes use of a singular protocol, 9P, to serve all objects in the system to userspace. It is hybrid because some parts are in-kernel to avoid the overhead of system calls such as IP, and TLS. Another interesting design aspect is how the in-kernel drivers are minimal mostly acting as a 9P interface to the hardware logic. This way you A. turn a machine object like a pointer or record into a seekable file and B. secured that file using standard unix permissions and C. can easily distribute the components across machines via a network. Now you can securely push the driver logic into a userspace program. And 9p is network and architecture transparent meaning you can work across multiple machines running on Arm, x86, mips, etc. All this come out of the box.
When I go back to Linux/Unix or Windows from Plan 9 its a sad and frustrating time. They're about as flexible as igneous rock. All the features are ad-hoc bolted on in mutually incompatible ways using a myriad of protocols that all do the same thing - serve files/objects. Ugh.
The utility of microkernels is orthogonal to hardware/software co-design.
From a practical engineering perspective, monolithic kernels were faster/easier/had more resources behind them and security is as robust as can be had with C (i.e. best effort and full of bugs). Lots of hardware has been introduced to try and mitigate that mess. But you theoretically wouldn't need a security co-processor with SeL4 because you have a very high degree of confidence that processes are isolated from each other and there are no root level exploits. So yes, hardware/software codesign is important!
That being said, the team behind SeL4 has had to spend a lot of engineering resources on eliminating side channels from the hardware. Hardware is also faulty as the real world doesn't care about your physics simulation.
The benefits of a microkernel here is that it is small enough for formal verification to be tractable. The proof itself is 10x the size of the kernel. Context switching on SeL4 is an order of magnitude faster than Linux, so the performance impact should be negligible. But if we magically could verify the millions of lines of a monolithic kernel, then it would still be faster to not do context switching. Indeed, the SeL4 team tried to move the scheduler to userspace but it cost too much in terms of performance. So it was kept in and added to the proof burden.
> It's like the difference between a 80286 and a 80386: The latter added the hardware support for true multitasking that the former lacked.
I'm not sure this is a good comparison. The 286 did support true multitasking in protected mode, and it was used in many non-DOS operating systems. What the 386 added (among other things) is the virtual 8086 mode which allowed it to multitask existing real mode DOS applications that accessed hardware directly.
Sorry, I'm a bit late responding. You're right! I should have used 8086 and 80286 as the example. 386 added virtual memory and some other stuff.
Thanks
That doesn't seem right. Even with strong hardware protections, how is Linux's TCB comparable to a microkernel? Unless it just recreates the exact same protection domains, there will be more vulnerability in Linux. Rather, the thing about hardware is primarily that it makes things more efficient. For example, microkernels nowadays are already quite robust because they make good use of certain hardware: the MMU. Then the small TCB of a microkernel gives the kernel credibiity, so the kernel and hardware together make for a solid infrastructure. So really it's a matter of how much "cheating" we're allowed to do with hardware, but microkernels overall utilize the protection better. Or see exokernels.
https://genode.org/index
An operating system with seL4 support
Are there notable uses of Genode?
I gave a presentation on SeL4 for my local OWASP chapter. I'll have to see if I can dig it up.
The project truly is a nice piece of kit, but I would hesitate to consider it as a replacement for Linux, especially for general purpose computing. Though that isn't to say microkernels are terrible for it in general. RedoxOS seems to be making some inroads lately and it uses a microkernel written in Rust.
The question is always "how big of a replacement are we talking?" Redox seems to be committed to interoperate well with POSIX, which naturally informs its design decisions. And there's a huge difference between having technical capabilities and succeeding. Although if Redox does succeed, it would already be a good step. seL4 is even more extreme in these qualities: its technical advantages are superb, but so far (and I think this will continue) it does not have what it takes, whatever that encompasses, to be the Next Big Thing. Ignoring political considerations, I think microkernels will be successful, and rightfully so.
>I would hesitate to consider it as a replacement for Linux
It would depend on the scenario. Of course, Linux is easier to work with but OTOH there are requirements which only seL4 can satisfy.
There's a lot required on top of seL4 for it to be actually useful. Fortunately, there has been a lot of Open Source work there as well. This is where seL4 is in a much better position than just a few years ago.
For static scenarios, there's LionsOS[0], quite usable already.
For dynamic scenarios, there's the Provably Secure, General-Purpose Operating System[1], which is still early stages.
Both can be found in the Projects page[2] at trustworthy systems, which is linked in the seL4 website.
0. https://trustworthy.systems/projects/LionsOS/
1. https://trustworthy.systems/projects/smos/
2. https://trustworthy.systems/projects/
Thank you kindly for providing these links. I do hope that microkernels take off in a big way for general purpose computing someday.
Does the OS that lies on top of this kernel need to be formally verified as well for the security guarantees to hold?
The guarantees offered by the kernel cannot be subverted by unprivileged processes running on it.
Of course, the kernel is not very useful on its own, thus the design of drivers, filesystem servers and other services running on top of the kernel is still relevant.
Note that, unlike most other systems (including Linux) which are flawed at a fundamental level, seL4 actually enables the construction of a secure and reliable system.
Well, as long as the hardware underneath it also enables the construction of a secure system.
I don't think we have any such option right now.
An example of this is timing side channels.
Originating in an effort within seL4[0], there's ongoing work[1] in RISC-V to resolve this.
0. https://sel4.systems//Foundation/Summit/2022/slides/d1_11_fe...
1. https://lf-riscv.atlassian.net/browse/RVS-3569
No, that's the advantage is that the kernel/processes don't need to be trusted since your kernel guarantees the isolation. So you can have a Linux kernel running next to some high security process with the guarantee that they will be isolated (with the exception of allowed IPC)
No.
But there are limitations. DMA off, only formally verified drivers
It's also important to note that se4L multicore kernel is not yet verified.
An IOMMU can help significantly with the driver problem, preventing a properly initialized driver from misbehaving and compromising the system.
I want the magical IOMMUs that are maturely secure like MMUs are now. For now, I think various efforts in verifying/restricting/generating drivers are far better, although they fall particularly flat for proprietary drivers.
>I want the magical IOMMUs that are maturely secure like MMUs are now.
There's nothing magical about IOMMUs. They weren't invented last week either.
Driver and hardware talk to each other using virtual memory instead of physical memory, preventing the scenario where a bug causes DMA to shit all over somebody else's memory.
What holds is that systems that run drivers in supervisor mode have not been able to leverage an iommu to its full extent.
My (admittedly limited) understanding is that IOMMUs still have practical roadblocks to being a solidly established part of the security of the computer. Of course they aren't bad in principle. Perhaps it's just that we aren't willing to eat the performance cost of making them more robust, but then performance is a tortured debate.
The primary roadblock is cost and complexity. The technology itself is sound and doesn't have major performance problems.
I want capability-based addressing like in the Plessy 250...
https://en.wikipedia.org/wiki/Plessey_System_250
In absolute terms, sure. At the practical level, you can find a partial answer in the section 7.2 of the paper.
Also worth checking out Drew DeVault's Helios Microkernel, which is based on SeL4.
https://ares-os.org/docs/helios/
I think there's a meaningful difference between "based on" and "inspired by", Helios being the latter.
L4 was popular at my university (Karlsruhe). While I never really looked into it in any detail, it always appeared to me like a project that is primarily interested in testing some theoretical ideas, but not in building anything that would be practically useful.
That was 20 years ago. As far as I can tell, this has not changed. (Quick googling tells me there appear to be some efforts to build an OS on it, but they all look more like proof of concepts, not like something with real-world use.)
https://en.wikipedia.org/wiki/L4_microkernel_family seems to describe it having been used all over the place, though mostly in embedded contexts.
> OKL4 shipments exceeded 1.5 billion in early 2012,[4] mostly on Qualcomm wireless modem chips. Other deployments include automotive infotainment systems.[13]
> Apple A series processors beginning with the A7 contain a Secure Enclave coprocessor running an L4 operating system[14] called sepOS (Secure Enclave Processor OS) based on the L4-embedded kernel developed at NICTA in 2006.[15] As a result, L4 ships on all modern Apple devices including Macs with Apple silicon.
I feel like the two of you are talking past each other as the other poster is talking about time sharing systems like UNIX while you are pointing at embedded applications that are OS optional and if there is any semblance of an OS, it is really minimal. People deploying L4 there is not very surprising and does not apply to his remarks. Seeing this is like watching one person discuss an absence of apples and seeing another reply saying that there are oranges.
Ok, admittedly, that's more than I expected. And hey, if they found some use cases in the embedded world, fine.
Nevertheless, it's not exactly how it was pitched. This Microkernel thing always had a "Linux is doing it wrong, we're doing it better" story behind it. They wanted to be a replacement for a general-purpose OS. That cetainly hasn't happened.
OSF/1 is the closest that microkernels ever came to being used in general purpose operating systems:
https://en.wikipedia.org/wiki/OSF/1
The only still living variant is AIX. Some of the code is in the XNU kernel, although that is a monolithic kernel rather than a microkernel. Apple converted it to a monolithic design for performance reasons.
That said, this article suggests that AIX is not using a microkernel either:
https://tedium.co/2019/02/28/ibm-workplace-os-taligent-histo...
> The only still living variant is AIX.
This is incorrect.
The thing you have to understand about “AIX” is that historically it was not a single codebase, it was just a common brand for multiple unrelated IBM Unix offerings. (DB2 is the same story.) One of those offerings was AIX/ESA for IBM mainframes - and that was indeed based on the OSF/1 microkernel. But AIX/ESA died in the early 1990s-it was replaced by turning MVS aka OS/390 into a certified Unix, a role carried on by z/OS today, and later on by z/Linux as well. Whereas, the only AIX that survives to this day is the RS/6000 branch which isn’t based on OSF/1, and has a classic monolithic kernel design. Its distant ancestor, AIX version 1 for the RT PC, did use a microkernel (VRM), but that microkernel was very different from Mach and OSF/1 (it was written in a PL/I dialect, PL.8, that’s how alien it was)
Additionally it is the only UNIX that follows a Windows like approach to shared libraries, which might confuse folks coming from POSIX land.
What specifically makes AIX's approach to shared libraries "Windows like"?
Using COFF instead of ELF, nowadays XCOFF.
https://www.ibm.com/docs/en/aix/7.3?topic=formats-xcoff-obje...
Having its version of dependency files, export files in Aix speak,
https://learn.microsoft.com/en-us/troubleshoot/windows-clien...
https://www.ibm.com/docs/en/xl-c-aix/13.1.3?topic=library-co...
https://download.boulder.ibm.com/ibmdl/pub/software/dw/aix/e...
Just like with VC++, lazy loading of libraries on symbol first use, done by the compiler.
https://learn.microsoft.com/en-us/cpp/build/reference/linker...
https://www.ibm.com/docs/en/aix/7.3?topic=memory-shared-libr...
I had some fun with Aix 5 back in the 2000's.
Thanks for the clarification.
Your OSF/1 link makes a lot of references to Mach 2.5 which still required the Unix portions to be in kernel memory space so it was still a monolithic kernel. This was the same as NeXTSTEP which is the basis for the current Mac OS X. It looks like OSF/1 did move to Mach 3.0 which was a microkernel but I'm not sure if the Digital Unix / Tru64 moved to that or not.
As for AIX, I used it on an RT around 1990 and then POWER based RS/6000 machines in the 1990's. It was a monolithic kernel.
Your own link mentions the failed Mach based Workplace OS and all the different "personalities."
From your link:
Internal discussion at IBM focused on AIX. Finally, at Comdex in 1993, IBM Chairman Louis Gerstner announced that the microkernel would not replace AIX. IBM realized that many AIX users would not accept performance penalties associated with microkernel systems. IBM was also concerned with the microkernel presenting a competitive impediment against high performance HP or Sun Unix systems than ran directly on the hardware. Instead, Gerstner told AIX customers that they would be able to migrate to Workplace OS, later if they were interested.
https://en.wikipedia.org/wiki/IBM_AIX#IBM_mainframes
> With the introduction of the ESA/390 architecture, AIX/370 was replaced by AIX/ESA[28] in 1991, which was based on OSF/1, and also ran on the System/390 platform.
I think another way to look at it is that "AIX" is a larger term for a family of operating systems. The one that I used and probably the most popular was the monolithic kernel based on System V Unix that ran on the RT and POWER architectures.
The AIX version that would run on Mach for workstations was never commercially released. The AIX version for mainframes was based on Mach but AIX/ESA. You can see other versions like AIX/386
OSF/1 1.3 switched to Mach 3.0.
You are right. I didn't realize that OSF/1 was used by anyone other than DEC which rebranded their version as Digital Unix and then Tru64 Unix.
The wikipedia page has a link to this Intel Paragon supercomputer which specifically says OSF/1 and Mach 3.0 microkernel.
https://archive.org/details/bitsavers_intelsupergonXPSXPEBro...
Genode is an actively developed, general purpose desktop operating system that works today, and can use seL4, among other kernels.
The last surviving OSF/1 based OS is macOS, not AIX
I thought OSF/1 was more of a kernel organ donor to macOS rather than the basis for it.
The entire "unix" part was updated with OSF/1 source code, with patches here and there of Free and NetBSD codebase.
But especially when compiling code that was written in C for "modern BSD" (Net/Free/Open) you can easily hit the part where userland etc. is OSF/1 branched off BSD 4.3/4.4 and simply does not include things people "assume" are present on BSD (like the more advanced linked list and tree macros)
Jochen Liedtke became a professor in 1999 in Karlsruhe, sadly he passed away only shortly after in 2001. I don't know if his successor Bellosa still does research on L4. There was the L4Ka project which appears to be completed. In the bachelor lecture on OS by him it's not part of the curriculum.
Rittinghaus, alumni of Bellosa, is involved with Unikraft [0], which was featured a couple of times on hn, and is using unikernel technology.
[0] https://unikraft.org/
After Liedtke's passing, L4 research was continued in groups at UNSW (Gernot Heiser, formal verification -> SeL4) and TU Dresden (Hermann Haertig, Fiasco/L4Re, focusing on real-time and secure systems).
Genode (already mentioned in another comment) [1] came out of the TU Dresden group with some nice ideas around managing compartmentalized systems. Kernkonzept [2] is a startup commercializing the L4Re microkernel.
[1] https://genode.org/ [2] https://www.kernkonzept.com/
Well, an L4 variant is used in iPhones: "The Secure Enclave Processor runs an Apple-customized version of the L4 microkernel." https://support.apple.com/de-at/guide/security/sec59b0b31ff/...
the L4Re derivative (open source) is running in every single id.X Volkswagen car in the central "icas1" ECU, carrying Linux and other guests.
https://www.kernkonzept.com/kk_events/elektrobit-advances-au...
afaics the L4Re kernel also is part of Elektrobit Safe Linux.
I love the work (and the direction) the Karlsruhe team did on L4Ka, especially with Pistachio - the design was clean, simple, easy to grasp.
I did a Pistachio-based OS for my diploma thesis (not at Karlsruhe). I always thought that if I'd been studying there, I'd probably go into OS research.
L4, L4Ka developed in Karlsruhe is not the same as seL4.
I also had ideas of an operating system design and the capabilities I had considered uses the same features of interposition and delegation as seL4 does. There are other advantages than just what they list there; for example, you can apply a filter to audio, or you can use proxy capabilities for implementing network transparency.
The real-time stuff is something I had considered as allowing as an optional implementation; my idea was for a specification rather than only a single implementation.
Another feature I wanted though, is that all programs operate deterministic except for I/O. Without I/O, you can't determine the date/time or how long a program takes, can't check for processor features (if you use something that the hardware does not support though, the operating system may emulate it), etc.
I had thought to use a mixture of hardware support and software support to implement it, though. (There is a note in the document about attacks with capability implemented in hardware, but I do not have the referenced document and I don't know if that attack applies to what I am thinking of doing.)
The point about security is that it seems to present the same failure as kvm is for Linux kernel. If the hypervisor is in the ring 0 you have the risk of VM escape from one to another or the host itself.
How do you mitigate that risk?
In seL4's virtualization support, VM exceptions are turned into messages and handled by VMM, a task running in unprivileged mode.
VMM has no more capabilities than the VM itself, thus a VM escape would be, outside of academics, of no value.
Refer to pages 8 to 10 in the OP PDF.
SeL4 is proof that microkernels are safe, efficient and scalable yet we are stuck with big honking Linux kernels in 2025. That said more and more drivers moving usermode anyway so it's a wash in the end.
I will caution that IPC microbenchmarks should not be taken as confirmation that the "academic microkernel" is viable: OS services all in userspace and with fine granularity as is appropriate. Often microkernel-like designs like VMMs/hypervisors and exokernels make use of unikernels/library OSes on top, which reduces the burden of fast IPC somewhat. Or developers intentionally lump protection domains to reduce IPC burden. Of particular note: even seL4 did not evict its scheduler to userspace. Since the scheduler is the basis behind time management, it's quite a blow to performance if the scheduler eats up time constantly. My own thoughts there are that, with care, a userspace scheduler can efficiently communicate with the kernel with a shared memory scheme, but that is not ideal. But for desktop and mobile, a microkernel design would be delightful and the performance impact is negligible. We need far more investment on QoS there.
Edit: That being said, we should be building microkernel-based OSes, and if for some cases performance really is a restricting factor, they will be exceptions. The security, robustness, flexibility, etc. of microkernels is not to be understated.
They verified that the scheduler doesn't interfere with the integrity, confidentiality, and authenticity requirements of the kernel so it's a moot point.
Rather, although I believe the seL4 scheduler is sufficiently general, I want a userspace scheduler to minimize policy. The seL4 team recognizes that a kernelspace scheduler violates Liedtke's minimality principle for microkernels, since the only motivating reason is performance. If an efficient userspace scheduler implementation exists, the minimality principle dictates no kernelspace scheduler. Otherwise there's pointless performance overhead and possibly policy inflexibility.
I understand the minimality principle but one should not be afraid to violate layers of abstraction when justified. The whole point of the minimality principle is to improve modularity and remove the need to worry about the correctness of various OS components.
What would you be able to do in a userspace scheduler that couldn't be done safely in an in-kernel scheduler? Why couldn't it be configured or given a safe API to interact with userspace at runtime? But I guess your shared memory bit is an API to a userspace scheduler?
Well, one thing about adapting seL4 is that we shouldn't assume derivatives will be formally verified to the same degree. Ideally we would, but I don't expect this to happen in practice. And having a kernelspace scheduler that is minimal and requires a userspace scheduler to achieve full control of scheduling, where a userspace scheduler would do, is redundant overhead. If the kernelspace scheduler wasn't minimal, it would both be hard to verify and extend. This is all in line with seL4's design. So you're concerned about power where the question was always about performance. Hydra had userspace schedulers way back when, and it wasn't followed because it was prohibitive then. The minimality principle might be given various explanations, but I think the end result of what a system following it achieves speaks for itself.
> I understand the minimality principle but one should not be afraid to violate layers of abstraction when justified.
Rather, in a well-designed system, the justification necessarily is performance. Abstractions exist for the ease of the developer to write code that is correct. Performance (runtime, memory usage, etc.) is the concrete impact of code. For concerns like debugging or flexibility, those arise if the abstractions are lacking. Performance concerns arise because even good abstractions impose overhead. Note that conventional layered architectures are usually poor abstractions for implementation. Good for design but not coding.
> But I guess your shared memory bit is an API to a userspace scheduler?
Indeed. I would rather not, but shared memory probably is necessary to some degree for a userspace scheduler.
Drivers in userspace is not particularly microkernelly - most of the major monolithic kernels have supported this to some degree or another for years (it is easy in principle, just transmit I/O requests to userspace servers over some channel) while many historic microkernels (see e.g. Mach 3) did not do it. It hardly changes the architecture of the system at all.
It is moving the higher-level things into userland that is the harder problem, and the one that has been challenging for microkernels to do well.
People should stop bringing up Mach so much. It should never have been the poster child for microkernels. It's poisoned the discourse when there are plenty of alternative examples. Granted, Mach also did some good work in the space, but its shortcomings are emphasized as if they reflect the whole field.
More to the point, drivers in userspace is an important distinction between "pure" monolithic kernels and microkernels: the former is optimizing for performance and the latter is optimizing for robustness. It's not about ease of implementation for either. It's quite meaningful to shift on the axis nowadays: it represents a critical pragmatic decision (notice purity is irrelevant). You're right that "higher-level things" such as the networking stack or filesystem are also crucial to the discussion. I think here, too, ease of implementation is not relevant, though.
Isn't it common to run OSes on desktop/server environment inside hypervisors? That means the OS itself can be transparently virtualized or preempted, and access to physical hardware can be transparently passed to the virtualized OSes. This can be accomplished today with minimal impact to performance on user experience.
The fact that this can be done with OS code not explicitly designed for this signals to me that there are no roadblocks to having a high-performing general purpose microkernel running our computers.
This leads to big units of code, namely multiple OSes, whereas the ideal is being able to use as finely granular units as developers are able to stomach. For example, Xen can work around the issue of device drivers by hosting a minimal OS instance that has drivers, but it is better to be able to run drivers as individual processes. This reduces code duplication and performance overhead.
You're absolutely right but I was making the point that there's demonstrably no significant downsides to running driver/kernel code in less privileged mode, rather than Ring 0.
To the best of my knowledge, seL4 is not AVX-512 aware. The AVX-512 state is not saved or restored on a context switch, which is clearly going to impact efficiency.
At present there's 16x64-bits of register state saved (128B), but if we were to have full support for the vectors, you need to potentially add 32x512-bits to the state (plus another 16 GP registers when APX arrives). Total state that needs moving from registers to memory/cache would jump to 2304B - a 1800% increase.
Give that memory is still the bottleneck, and cache is of limited size, a full context switch is going to have a big hit - and the main issue with microkernels is that a service you want to communicate with lives in another thread and address space. You have: app->kernel->service->kernel->app for a round-trip. If both app and service use the AVX-512 registers then you're going to have to save/restore more than half a page of CPU state up to 4 times, compared with up to 2 on the monolithic kernel which just does app->kernel->app.
The amount of CPU state only seems to be growing, and microkernels pay twice the cost.
The cost of moving 4 KB is miniscule. Assume a anemic basic desktop with a 2 GHz clock and 2 instructions per clock. You would be able to issue 4 billion 8-byte stores per second resulting 32 GB/s or 32 bytes per nanosecond. Memory bandwidth is going to be on the order of 40-60 GB/s for a basic desktop, so you will not run into memory bandwidth bottlenecks with the aforementioned instruction sequence. So, 4 KB of extra stores is a grand total of 128 additional nanoseconds.
In comparison, the average context switch time of Linux is already on the order of ~1,000 nanoseconds [1]. We can also see additional confirmation of the store overhead I computed as that page measures ~3 us for a 64 KB memcpy (load + store) which would be ~187 nanoseconds for 4 KB (load + store) where as the context saving operation is a 2 KB register -> store and a 2 KB load -> register.
So, your system call overhead only increases by 10 percentage points. Assuming you have a reasonably designed system that does not kill itself with system call overhead, spending the majority of the time actually handling the service request, then it constitutes a miniscule performance cost. For example, if you spend 90% of the time executing the service code, with only 10% in the actual overhead, then you only incur a 1% performance hit.
[1] https://eli.thegreenplace.net/2018/measuring-context-switchi...
I agree that a well-designed system for most use cases won't have performance issues, since we should not just be optimizing context switches but also things like kernel bypass mechanisms, mechanisms like io_uring, and various application-guided policies that will reduce context switches. Context switches are always a problem (the essential complexity of having granular protection), and moving an extra 4KB is not negligible depending on the workload, but we are not out of options. It will take more programmer effort, is all.
I appreciate your criticism; it is reasonable and relevant in the present and the future. I wrote a reply nearby in the thread[0]. I think it's mostly solveable, but I agree it's not trivial.
[0] https://news.ycombinator.com/item?id=43456628
>AVX-512 and other x86 warts
While seL4 runs on x86, it is not recommended.
Effort seems to be focused on ARM and RISC-V, particularly the latter.
An argument could be made that x86's baggage makes it a particularly bad fit for offering the sort of guarantees seL4 does.
High-assurance security requires kernels clear or overwrite all shared state. That could become a covert, storage channel if one partition can write it and another read it. If so, it should be overwritten.
I’m not sure that seL4 even supports NUMA, so there are other tradeoffs to consider.
I think NUMA management is high level enough that in a microkernel it would be comfortably managed in userspace, unlike things relevant to performance-critical context switches. And seL4 is currently intended only for individual cores anyways.
seL4 having a proof of correctness does not mean all microkernels do. In fact, seL4 is the only microkernel that has a proof of correctness. If you build on top of it in the microkernel way, you quickly find that it is not performant. That is why NT and XNU both abandoned their microkernel origins in favor of becoming monolithic kernels.
I’ve seen this argument play out many times. I believe the next line is: “QNX proved that micro kernels can be fast given clever message passing syscall design.”
“I remember running the QNX demo disc: an entire graphical operating system on a single 1.44MB floppy! Whatever happened to them?”
“They got bought by blackberry, which ended as you’d expect. QNX had a lot of success in automotive though.”
“Nowadays Linux and Android are dominant in new cars, though, proving once and for all that worse is better.”
exeunt. End Scene
Also an illustration how an open-source solution, even if technically inferior, would displace a closed-source solution, even if technically superior, unless there is a huge moat. And huge moats usually exist only in relatively narrow niches.
It turns out that success is composed of 90% luck, 10% marketing, and 5% talent/technical advantage. A rhetorical question: how do you entice people to turn a movement into a revolution when it isn't likely the movement will succeed?
Another rhetorical question: out of luck/marketing/technical advantage, which one is contributing the most to the extra 5% out of 105% of all the components success can be attributed to?
Nice use of Latin. mihi placet.
NT and XNU never had microkernel origins - Cutler explicitly refuted this at a conference in the early 1990s (if anyone remembers which, kindly share) and NeXTSTEP forked Mach at version 2.5, which was not a microkernel (see https://cseweb.ucsd.edu/classes/wi11/cse221/papers/accetta86...). XNU retained this monolithic architecture (as did Tru64 and other systems that trace their heritage to Mach prior to its version 3).
NeXTSTEP forked Mach at version 2.5
Various sources state that they rebased to OSF Mach Kernel 7.3, which was based on Mach 3 and parts of Mach 4. The OSF MK ancestry of macOS XNU can still be seen in paths:
https://github.com/apple-oss-distributions/xnu/tree/main/osf...
This is quite a ship-of-Theseus problem. Steve Jobs hired Avi Tevanian to work at NeXT on the Mach kernel, and later brought Tevanian to Apple; he was still working on the Mach/Darwin kernel as late as the first iPhone release. There is a reasonable argument to be made that the OSF kernel is actually the derivative, by default.
There are a few references to the OSF Mach kernel from '98 in a few files, such as https://github.com/apple-oss-distributions/xnu/blob/main/osf...
EDIT: Apple calls out Mach 3.0.
https://developer.apple.com/library/archive/documentation/Da...
https://developer.apple.com/library/archive/documentation/Da...
Apple took Mach 3.0 and collapsed it into a single address space for performance purposes. Certainly reasonable.
As far as I can tell (this isn't gospel, just something I've inferred as the only reasonable explanation from comparing the codebases), Apple reconstructed a Mach 2.5 style kernel, like NeXTSTEP had, from fresh codebases (OSF Mach and 4.4BSD-Lite2), perhaps because 4.3BSD was still encumbered at the time.
The Darwin 0.1 and 0.3 releases contain the old kernel, derived directly from NeXTSTEP, and that's the direct derivative of Mach 2.5. The later XNU appears to be a reconstruction of that kernel with unencumbered code and that's also when IOKit replaced DriverKit.
In more recent interviews Cutler has been firm that the NT kernel was designed pragmatically, a view that Tevanian also evidently later adopted, as evident with the great wheel of incarnation that Darwin went through on the road to XNU, although I'm not sure Tevanian ever stated his perspectives on this matter publicly. Neither of these systems were ever true monolithic kernels in the Linux or Unix sense—at all times they both had some measure of sandboxing between e.g. driver code and the scheduler—rather sitting somewhere between.
True microkernels are, alas, more of an ideology than a practical reality, as the long-suffering GNU/HURD team discovered; Tanenbaum has been clear that the MINIX/NetBSD experiment was more about principles than performance. That said, certainly many hypervisors have attained success with configurations that coincidentally happen to be the same footprint as a microkernel.
From talking with some people who were witnesses to earliest days of Hurd (but not involved in the project), the real problems of Hurd were social not technical - if anything, OSF/1 & NeXTSTEP succeeding from similar code base (for some value of) points that it was possible to do it.
The story goes that the early days were dominated by some ridiculously bad project management that makes Cathedral and the Bazaar era "Cathedral" a nice and well maintained project.
Most importantly, GNU Hurd lacked the "ecological niche" to drive work towards it because it couldn't deliver enough, just like lack of information about 386BSD led to Linux (Linus explicitly said that if he had known about 386BSD, he would have worked on it and maybe forked it)
>the long-suffering GNU/HURD team discovered;
HURD is Mach-based; it is stuck in a pre-L4 world and thus irrelevant to this conversation.
Very 40 years ago technology. Still better than ~60yr ago UNIX, although they're trying to ultimately be UNIX-like.
It is sad Mach and HURD still keep coming up in microkernel discussion.
Maybe not that irrelevant, since another desktop and mobile operating system used around the world is also based on Mach.
There's some hope now that Apple is a seL4 foundation member.
A Mach that is neutered WRT it's microkernel architecture.
The hypervisor/microkernel boundary has blurred, and commercial success for microkernels is more or less as hypervisors, e.g. OKL4 is widely used in mobile devices[0]. And Tanenbaum's Minix is used for the Intel Management Engine. I don't know the details with HURD, but I don't think it is mainly suffering from "inherent" or "nearly-inherent" (strongly associated) problems of microkernels. The main cost of microkernels is developer effort and likely performance overhead.
[0] https://vita.militaryembedded.com/5159-open-virtualization-a...
That is why I thought Windows NT had originally been intended to be a microkernel.
NT incorporates a client/server model, like a microkernel, but isn't a "pure" microkernel like Mach. This gives rise to the 'hybrid kernel' term, which Linus dislikes.
Wikipedia has a decent depiction of the hybrid structure, but quoting from the NT Design Workbook:
> This specification describes the kernel layer of the NT OS/2 operating system. The kernel is responsible for thread dispatching, multiprocessor synchronization, hardware exception handling, and the implementation of low-level machine dependent functions.
> The kernel is used by the executive layer of the system to synchronize its activities and to implement the higher levels of abstraction that are exported in user-level API's.
https://en.wikipedia.org/wiki/Windows_NT_3.1#Architecture
NT in Windows 11 has very little to do with the monolithic kernel story that keeps being repeated.
Not only did the graphics stack moved again back into userspace, there is now a complete userspace drivers stack, and VBS (Virtualization-based security) for several kernel components that run on their mini Hyper-V island, talking via IPC with the rest of the kernel.
Likewise on XNU land, Apple has started a crusade already a few years ago, to move all kexts into userspace, no more drivers in the kernel beyond the ones Apple considers critical.
In both cases, they never were a pure monolithic kernel, due to the way walls were introduced with kernel level IPC to talk across modules, instead of straight function calls.
> seL4 is the only microkernel that has a proof of correctness
ProvenCore (https://provenrun.com/provencore/) has a proof that covers correctness (memory safety, and more generally absence of UB, termination, etc.), functional properties (e.g. the fork() system call is a refinement of an abstract "clone" operation of abstract machines), and security properties (memory isolation).
I heard a joke somewhere that sel4 is even more successful than Linux because it is running below ring 0 in every Intel chip shipped in the past N decades, plus probably many others.
Intel used a version of Minix rather than seL4 for its Intel Management Engine. [1] There was some controversy about this because they didn't give Andrew Tanenbaum proper credit. [2]
[1] https://www.zdnet.com/article/minix-intels-hidden-in-chip-op...
[2] https://www.cs.vu.nl/~ast/intel/
Such a dumb technical choice driven by stupid managerial considerations. They do this ring -1 shit because hardware companies view this as a cheap way to add value. But they don't open source it or contribute back because they view it as secret sauce. Minix as a result didn't get the investments that GPL software receives. Now the project is in hard legacy mode.
Intel Management Engine is a separate microcontroller integrated into the chipsets. Recent ones are Intel Quark x86 CPU and Minix 3 OS.
That is Minix, not SeL4.
seL4 is used in a lot of cellular phone firmwares I believe.
I feel like containers and Kubernetes are microkernels revenge.
They are for all practical purposes fulfilling the same role.
Not at all? If anything they’re filling the opposite role. Microkernels are about building interfaces which sandbox parts of the kernel. Namespaces are about giving sandboxed userlands full access to kernel interfaces.
Namespaces is one form of capabilities.
Additionally a Linux kernel that exists for the sole purpose to keep KVM running, while everything that powers a cloud workload are Kubernetes pods, it is nothing more than a very fat microkernel, in terms of usefulness.
The dose makes the poison; we're still a long way from fulling embracing microkernels and capabilities. Security is a holistic property and encompasses finer details too. I want a small TCB. I want capabilities pervasively. And in pursuit of modularity and abstraction, I want to be able to choose the components I want and take those burdens myself. It's a bit silly seeing the nth SIGOPS-SOSP paper on how Linux can be improved by integrating userspace scheduling.
It is the same in safer systems programming languages, we already have the concept since 1961, but apparently making the industry take the right decisions is a tenuous path until something finally makes good ideas stick and gain adoption.
Microkernel does not mean it uses capabilities. And "very fat microkernel" is an oxymoron. The definition of a microkernel is that they do as little as possible in the kernel.
Of course it doesn't.
The point is how Linux is being tamed to provide some of the concepts, in spite of its monolithic design.
But naturally we can discuss minutiae instead.
It’s not “some of the concepts” nor minutae, though, it’s literally the difference between a microkernel and a monolithic kernel.
Presenting capabilities / namespaces to userland is a completely different and in the case of Linux, orthogonal thing to presenting capabilities/namespaces to kernel services. I guess you could argue that the concept of capabilities came from microkernels, but when it’s applied to only user space, it’s just not really related to a microkernel anymore at all.
That’s basically the whole problem with capabilities and especially their application in namespaces from a security standpoint in Linux: they try to firewall these little boxes from each other but the kernel they’re all talking to is still one big blob. And this difference is meaningful in a security sense, not just some theory hand waving. https://www.crowdstrike.com/en-us/blog/cve-2022-0185-kuberne... is just one good example, but entire classes of mitigations are rendered meaningless by the ability to unshare into a box that lets an attacker touch exploitable kernel surface area which is not further isolated.
A microkernel by the book, is just a bunch of processes doing stuff that would be on the kernel otherwise, if we are then discussing minutae.
Nothing else, critical set of OS services are no longer hosted in a single process space, rather required a distributed system of processes running on a single computer node.
The moment these processes are able to be hosted in a set of computing nodes, we enter the realm of distributed OSes, is is another discussion.
The amount of services that remain on the mikrokernel, versus what is hosted in each OS process, running outside of the kernel, depends on each microkernel, there is no rule for this division of labour and each one that was ever designed has chosen a different approach.
If you cannot see the parallel between this and a swarm of containers running on a node, doing the actual workload and only requiring the kernel services due to the way KVM is implemented, and rather focus on a security mechanism that is detail on how Linux works, well, it isn't on me to further break it down.
OK, I get what you are saying now, and I agree if we are talking about KVM.
That's not how Kubernetes works by default, so I thought we were talking about container runtimes. I still disagree strongly from both a practical security and theory standpoint if we are talking about container runtimes implemented using namespaces and cgroups.
I don't think containers, namespaces, and the like failing to provide the same benefits of a true microkernel negate the OPs point. They are ways of segmenting userspace in a more finely grained manner and they do make attacks harder.
Linux security being a shit show and undermining these efforts is kinda besides the point: they are still attempts provide a runtime closer to what microkernels would naturally provide in a backwards compatible way. Indeed, these containers could be turned into fully fleged VMs if there were the resources to make it happen.
I don't really get this argument: "you're saying that one thing, namespaces, isn't implemented in any way resembling a microkernel, but what if we replaced it with another completely different thing, a hypervisor? Then it would be similar!" Yes? Sure?
To me the word "microkernel" expresses how the kernel is structured, not what userspace interface it presents. A microkernel is built by separating kernel services into discrete processes which communicate using a defined IPC mechanism. Ideally, a microkernel offers memory boundary guarantees for each service, either by using hardware memory protection/MMU and running each service as a true "process" with its own address space, or by proving the memory safety of each kernel service using some form of ahead-of-time guarantee.
Of course, doing this lends itself to also segmenting user-space processes by offering a unique set of kernel service processes for each user-space segment (jail, namespace, etc.), but there's no reason this needs to be the case, and it's by and large orthogonal.
I do agree with what I eventually understand the grandparent poster was trying to express, which is that running a bunch of KVMs looks like a microkernel. Because then, you've moved the kernel services into a protected boundary and made them communicate across a common interface (hypercalls).
But that's not how Kubernetes works by default and in the case of containers and namespaces, I think this is entirely false and a dangerous thing to believe from a security standpoint.
> They are ways of segmenting userspace in a more finely grained manner and they do make attacks harder.
From a _kernel_ security standpoint (because we are talking about micro_kernels_ here), I actually think namespaces make attacks much easier and the surface area much greater. Which is basically the entire point I was trying to make: rather than exposing fragile kernel interfaces to exclusively system services with CAP_SYS_ADMIN, you now have provided an ability (unshare) for less-trusted runtimes to touch parts of the host kernel (firewall, filesystem drivers, etc.) which they would normally not have access to, and you have to go back and use fiddly rules engines (seccomp, apparmor, selinux) to fix the problem you created with namespaces.
To be clear, I think from a big picture standpoint, it's a tradeoff, and I'm nowhere near as anti-container/anti-namespace as it may seem. I just get annoyed when I see people express namespaces as a kernel security boundary when they are basically the exact opposite: they are a kernel security un-boundary, and Linux's monolithic nature makes this a problem.
That's fair!
Imo the microness is not about size but about the architecture of running drivers/services in fault-resistant separation from the kernel
Yes and there's a very good reason: Linux is safe enough, more efficient, and more scalable.
I suppose I should ask the other side too, though I am biased to favor microkernels, and am better read on them, but how so?
"Safe enough" measured by the standards of this upside-down industry...I'll let everyone decide that for themselves.
"More efficient": while monolithic kernels have a higher ceiling, currently there's plenty of OS research papers and industry work demonstrating that more integration with userspace brings performance benefits, such as in the scheduler, storage, or memory management. Microkernels encourage/enforce userspace integration.
"More scalable": I think this has less to do with kernelspace scale and more with how nodes are connected. See Barrelfish[0] for eschewing shared memory in favor of message passing entirely, running separate kernels on each core. Meanwhile Linux has gradually discovered a "big lock" approach is not scalable and reduced coarse-grained locks, added RCU, etc.. So I think we will go moreso towards Barrelfish. But on a single core, for what goes into the kernel, that's covered by everything but scalability.
[0] https://people.inf.ethz.ch/troscoe/pubs/sosp09-barrelfish.pd...
"Safe enough" means "actually unsafe".
Yes, and bank vaults have doors.
[dead]