We now use the regular "user" physical pages for on-demand page table
allocations. This was by far the biggest source of super physical page
exhaustion, so that bug should be a thing of the past now. :^)
We still have super pages, but they are barely used. They remain useful
for code that requires memory with a low physical address.
Fixes#1000.
After MemoryManager initialization, we now only leave the lowest 1MB
of memory identity-mapped. The very first (null) page is not present.
All other pages are RW but not X. Supervisor only.
The kernel and its static data structures are no longer identity-mapped
in the bottom 8MB of the address space, but instead move above 3GB.
The first 8MB above 3GB are pseudo-identity-mapped to the bottom 8MB of
the physical address space. But things don't have to stay this way!
Thanks to Jesse who made an earlier attempt at this, it was really easy
to get device drivers working once the page tables were in place! :^)
Fixes#734.
We now can create a cacheable Region, so when map() is called, if a
Region is cacheable then all the virtual memory space being allocated
to it will be marked as not cache disabled.
In addition to that, OS components can create a Region that will be
mapped to a specific physical address by using the appropriate helper
method.
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.
Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.
There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.
THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.
Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
We now validate the full range of userspace memory passed into syscalls
instead of just checking that the first and last byte of the memory are
in process-owned regions.
This fixes an issue where it was possible to avoid rejection of invalid
addresses that sat between two valid ones, simply by passing a valid
address and a size large enough to put the end of the range at another
valid address.
I added a little test utility that tries to provoke EFAULT in various
ways to help verify this. I'm sure we can think of more ways to test
this but it's at least a start. :^)
Thanks to mozjag for pointing out that this code was still lacking!
Incidentally this also makes backtraces work again.
Fixes#989.
We now refuse to boot on machines that don't support PAE since all
of our paging code depends on it.
Also let's only enable SSE and PGE support if the CPU advertises it.
At the moment, addresses below 8MB and above 3GB are never accessible
to userspace, so just reject them without even looking at the current
process's memory regions.
We were happily allowing syscalls with pointers into kernel-only
regions (virtual address >= 0xc0000000).
This patch fixes that by only considering user regions in the current
process, and also double-checking the Region::is_user_accessible() flag
before approving an access.
Thanks to Fire30 for finding the bug! :^)
Instead of panicking right away when we run out of physical pages,
we now try to find a PurgeableVMObject with some volatile pages in it.
If we find one, we purge that entire object and steal one of its pages.
This makes it possible for the kernel to keep going instead of dying.
Very cool. :^)
Previously we assumed all hosts would have support for IA32_EFER.NXE.
This is mostly true for newer hardware, but older hardware will crash
and burn if you try to use this feature.
Now we check for support via CPUID.80000001[20].
Introduce one more (CPU) indirection layer in the paging code: the page
directory pointer table (PDPT). Each PageDirectory now has 4 separate
PageDirectoryEntry arrays, governing 1 GB of VM each.
A really neat side-effect of this is that we can now share the physical
page containing the >=3GB kernel-only address space metadata between
all processes, instead of lazily cloning it on page faults.
This will give us access to the NX (No eXecute) bit, allowing us to
prevent execution of memory that's not supposed to be executed.
I'm not sure how I managed to misread the location of this bit twice.
But I did! Here is finally the correct value, according to Intel:
"Page Global Enable (bit 7 of CR4)"
Jeez! :^)
Setting this bit will cause the CPU to generate a page fault when
writing to read-only memory, even if we're executing in the kernel.
Seemingly the only change needed to make this work was to have the
inode-backed page fault handler use a temporary mapping for writing
the read-from-disk data into the newly-allocated physical page.
To enforce this, we create two separate mappings of the same underlying
physical page. A writable mapping for the kernel, and a read-only one
for userspace (the one returned by sys$get_kernel_info_page.)
The fault was happening when retrieving a current backtrace for the
SystemServer process.
To generate a backtrace, we go into the paging scope of the process,
meaning we temporarily switch to using its page directory as our own.
Because kernel VM is allocated on demand, it's possible for a process's
mappings above the 3GB mark to be out-of-date. Normally this just gets
fixed up transparently by the page fault handler (which simply copies
the PDE from the canonical MM.kernel_page_directory() into the current
process.)
However, if the current kernel *stack* is in a piece of memory that
the backtraced process lacks up-to-date PDE's for, we still get a page
fault, but are unable to handle it, since the CPU wants to push to the
stack as part of calling the page fault handler. So we're screwed and
it's a triple-fault.
Fix this by always updating the kernel VM mappings before switching
into a paging scope. In practical terms, this is a 1KB memcpy() that
happens when generating a backtrace, or doing exec().
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.
The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.
If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.
Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
Now the userspace page allocator will search through physical regions,
and stop the search as it finds an available page.
Also remove an "address of" sign since we don't need that when
counting size of physical regions
Move the kernel image to the 1 MB physical mark. This prevents it from
colliding with stuff like the VGA memory. This was causing us to end
up with the BIOS screen contents sneaking into kernel memory sometimes.
This patch also bumps the kmalloc heap size from 1 MB to 3 MB. It's not
the perfect permanent solution (obviously) but it should get the OOM
monkey off our backs for a while.
After the page fault handler has found the region in which the fault
occurred, do the rest of the work in the region itself.
This patch also makes all fault types consistently crash the process
if a new page is needed but we're all out of pages.
Since the kernel page tables are shared between all processes, there's
no need to (implicitly) flush the TLB for them on every context switch.
Setting the G bit on kernel page tables allows the CPU to keep the
translation caches around.
This patch changes the parameter to Region::map() to be a PageDirectory
since that matches how we think about the memory model:
Regions are views onto VMObjects, and are mapped into PageDirectories.
Each Process has a PageDirectory. The kernel also has a PageDirectory.
Since a Region is merely a "window" onto a VMObject, it can both begin
and end at a distance from the VMObject's boundaries.
Therefore, we should always be computing indices into a VMObject's
physical page array by adding the Region's "first_page_index()".
There was a whole bunch of code that forgot to do that. This fixes
many wrong behaviors for Regions that start part-way into a VMObject.