Alot of code is shared between i386/i686/x86 and x86_64
and a lot probably will be used for compatability modes.
So we start by moving the headers into one Directory.
We will probalby be able to move some cpp files aswell.
Increase type-safety moving the MemoryManager APIs which take a
Region::Access to actually use that type instead of a `u8`.
Eventually the actually m_access can be moved there as well, but
I hit some weird bug where it wasn't using the correct operators
in `set_access_bit(..)` even though it's declared (and tested).
Something to fix-up later.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
There's no real system here, I just added it to various functions
that I don't believe we ever want to call after initialization
has finished.
With these changes, we're able to unmap 60 KiB of kernel text
after init. :^)
You can now declare functions with UNMAP_AFTER_INIT and they'll get
segregated into a separate kernel section that gets completely
unmapped at the end of initialization.
This can be used for anything we don't need to call once we've booted
into userspace.
There are two nice things about this mechanism:
- It allows us to free up entire pages of memory for other use.
(Note that this patch does not actually make use of the freed
pages yet, but in the future we totally could!)
- It allows us to get rid of obviously dangerous gadgets like
write-to-CR0 and write-to-CR4 which are very useful for an attacker
trying to disable SMAP/SMEP/etc.
I've also made sure to include a helpful panic message in case you
hit a kernel crash because of this protection. :^)
You can now use the READONLY_AFTER_INIT macro when declaring a variable
and we will put it in a special ".ro_after_init" section in the kernel.
Data in that section remains writable during the boot and init process,
and is then marked read-only just before launching the SystemServer.
This is based on an idea from the Linux kernel. :^)
If we try to align a number above 0xfffff000 to the next multiple of
the page size (4 KiB), it would wrap around to 0. This is most likely
never what we want, so let's assert if that happens.
Now that we no longer need to support the signal trampolines being
user-accessible inside the kernel memory range, we can get rid of the
"kernel" and "user-accessible" flags on Region and simply use the
address of the region to determine whether it's kernel or user.
This also tightens the page table mapping code, since it can now set
user-accessibility based solely on the virtual address of a page.
This patch adds Space, a class representing a process's address space.
- Each Process has a Space.
- The Space owns the PageDirectory and all Regions in the Process.
This allows us to reorganize sys$execve() so that it constructs and
populates a new Space fully before committing to it.
Previously, we would construct the new address space while still
running in the old one, and encountering an error meant we had to do
tedious and error-prone rollback.
Those problems are now gone, replaced by what's hopefully a set of much
smaller problems and missing cleanups. :^)
We need to make sure other processors can grab the MM lock while we
wait, so release it when we might block. Reading the page from
disk may also block, so release it during that time as well.
This eliminates the window between calling Processor::current and
the member function where a thread could be moved to another
processor. This is generally not as big of a concern as with
Processor::current_thread, but also slightly more light weight.
This was exploitable since the shared flag determines whether inode
permission checks are applied in sys$mprotect().
The bug was pretty hard to spot due to default arguments being used
instead. This patch removes the default arguments to make explicit
at each call site what's being done.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
The kernel ignored the first 8 MiB of RAM while parsing the memory map
because the kmalloc heaps and the super physical pages lived here. Move
all that stuff inside the .bss segment so that those memory regions are
accounted for, otherwise we risk overwriting boot modules placed next
to the kernel.
Problem:
- Many constructors are defined as `{}` rather than using the ` =
default` compiler-provided constructor.
- Some types provide an implicit conversion operator from `nullptr_t`
instead of requiring the caller to default construct. This violates
the C++ Core Guidelines suggestion to declare single-argument
constructors explicit
(https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c46-by-default-declare-single-argument-constructors-explicit).
Solution:
- Change default constructors to use the compiler-provided default
constructor.
- Remove implicit conversion operators from `nullptr_t` and change
usage to enforce type consistency without conversion.
If a TLB flush request is broadcast to other processors and the addresses
to flush are user mode addresses, we can ignore such a request on the
target processor if the page directory currently in use doesn't match
the addresses to be flushed. We still need to broadcast to all processors
in that case because the other processors may switch to that same page
directory at any time.
By designating a committed page pool we can guarantee to have physical
pages available for lazy allocation in mappings. However, when forking
we will overcommit. The assumption is that worst-case it's better for
the fork to die due to insufficient physical memory on COW access than
the parent that created the region. If a fork wants to ensure that all
memory is available (trigger a commit) then it can use madvise.
This also means that fork now can gracefully fail if we don't have
enough physical pages available.
This adds the ability for a Region to define volatile/nonvolatile
areas within mapped memory using madvise(). This also means that
memory purging takes into account all views of the PurgeableVMObject
and only purges memory that is not needed by all of them. When calling
madvise() to change an area to nonvolatile memory, return whether
memory from that area was purged. At that time also try to remap
all memory that is requested to be nonvolatile, and if insufficient
pages are available notify the caller of that fact.
Notice that we ensured that the size is a multiple of the page size and
that there is at least one page there, otherwise, this change would be
invalid.
We create an empty region and then expand it:
// First iteration.
m_user_physical_regions.append(PhysicalRegion::create(addr, addr));
// Following iterations.
region->expand(region->lower(), addr);
So if the memory region only has one page, we would end up with an empty
region. Thus we need to do one more iteration.
Since the CPU already does almost all necessary validation steps
for us, we don't really need to attempt to do this. Doing it
ourselves doesn't really work very reliably, because we'd have to
account for other processors modifying virtual memory, and we'd
have to account for e.g. pages not being able to be allocated
due to insufficient resources.
So change the copy_to/from_user (and associated helper functions)
to use the new safe_memcpy, which will return whether it succeeded
or not. The only manual validation step needed (which the CPU
can't perform for us) is making sure the pointers provided by user
mode aren't pointing to kernel mappings.
To make it easier to read/write from/to either kernel or user mode
data add the UserOrKernelBuffer helper class, which will internally
either use copy_from/to_user or directly memcpy, or pass the data
through directly using a temporary buffer on the stack.
Last but not least we need to keep syscall params trivial as we
need to copy them from/to user mode using copy_from/to_user.
Sometimes a physical underlying page may be there, but we may be
unable to allocate a page table that may be needed to map it. Bubble
up such mapping errors so that they can be handled more appropriately.
If allocating a page table triggers purging memory, we need to call
quickmap_pd again to make sure the underlying physical page is
remapped to the correct one. This is needed because purging itself
may trigger calls to ensure_pte as well.
Fixes#3370
Add an ExpandableHeap and switch kmalloc to use it, which allows
for the kmalloc heap to grow as needed.
In order to make heap expansion to work, we keep around a 1 MiB backup
memory region, because creating a region would require space in the
same heap. This means, the heap will grow as soon as the reported
utilization is less than 1 MiB. It will also return memory if an entire
subheap is no longer needed, although that is rarely possible.