Setting this bit will cause the CPU to generate a page fault when
writing to read-only memory, even if we're executing in the kernel.
Seemingly the only change needed to make this work was to have the
inode-backed page fault handler use a temporary mapping for writing
the read-from-disk data into the newly-allocated physical page.
Every process keeps its own ELF executable mapped in memory in case we
need to do symbol lookup (for backtraces, etc.)
Until now, it was mapped in a way that made it accessible to the
program, despite the program not having mapped it itself.
I don't really see a need for userspace to have access to this right
now, so let's lock things down a little bit.
This patch makes it inaccessible to userspace and exposes that fact
through /proc/PID/vm (per-region "user_accessible" flag.)
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():
- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)
When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)
Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
This patch makes it possible to make memory regions non-readable.
This is enforced using the "present" bit in the page tables.
A process that hits an not-present page fault in a non-readable
region will be crashed.
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.
The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.
If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.
Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
After the page fault handler has found the region in which the fault
occurred, do the rest of the work in the region itself.
This patch also makes all fault types consistently crash the process
if a new page is needed but we're all out of pages.
This patch changes the parameter to Region::map() to be a PageDirectory
since that matches how we think about the memory model:
Regions are views onto VMObjects, and are mapped into PageDirectories.
Each Process has a PageDirectory. The kernel also has a PageDirectory.
Since a Region is merely a "window" onto a VMObject, it can both begin
and end at a distance from the VMObject's boundaries.
Therefore, we should always be computing indices into a VMObject's
physical page array by adding the Region's "first_page_index()".
There was a whole bunch of code that forgot to do that. This fixes
many wrong behaviors for Regions that start part-way into a VMObject.
Instead of allocating and populating a Copy-on-Write bitmap for each
Region up front, wait until we actually clone the Region for sharing
with another process.
In most cases, we never need any CoW bits and we save ourselves a lot
of kmalloc() memory and time.
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.
Kernel Regions are kept in various OwnPtr<Regions>'s.
Regions now only ever get unmapped when they are destroyed.
We were doing this for the initial kernel-spawned userspace process(es)
to work around instability in the page fault handler. Now that the page
fault handler is more robust, we can stop worrying about this.
Specifically, the page fault handler was previous not able to handle
getting a page fault in anything but the currently executing task's
page directory.
InodeVMObject is a VMObject with an underlying Inode in the filesystem.
AnonymousVMObject has no Inode.
I'm happy that InodeVMObject::inode() can now return Inode& instead of
VMObject::inode() return Inode*. :^)
The VMObject name was always either the owning region's name, or the
absolute path of the underlying inode.
We can reconstitute this information if wanted, no need to keep copies
of these strings around.
Region now has is_user_accessible(), which informs the memory manager how
to map these pages. Previously, we were just passing a "bool user_allowed"
to various functions and I'm not at all sure that any of that was correct.
All the Region constructors are now hidden, and you must go through one of
these helpers to construct a region:
- Region::create_user_accessible(...)
- Region::create_kernel_only(...)
That ensures that we don't accidentally create a Region without specifying
user accessibility. :^)
This is obviously more readable. If we ever run into a situation where
ref count churn is actually causing trouble in the future, we can deal with
it then. For now, let's keep it simple. :^)
This significantly reduces the pressure on the kernel heap when
allocating a lot of pages.
Previously at about 250MB allocated, the free page list would outgrow
the kernel's heap. Given that there is no longer a page list, this does
not happen.
The next barrier will be the kernel memory used by the page records for
in-use memory. This kicks in at about 1GB.
String&& is just not very practical. Also return const String& when the
returned string is a member variable. The call site is free to make a copy
if he wants, but otherwise we can avoid the retain count churn.