If the data passed to sys$write() is backed by a not-yet-paged-in inode
mapping, we could end up in a situation where we get a page fault when
trying to copy data from userspace.
If that page fault handler tried reading from an inode that someone else
had locked while waiting for the disk cache lock, we'd deadlock.
This patch fixes the issue by copying the userspace data into a local
buffer before acquiring the disk cache lock. This is not ideal since it
incurs an extra copy, and I'm sure we can think of a better solution
eventually.
This was a frequent cause of startup deadlocks on x86_64 for me. :^)
Previously, the heap expansion logic could end up calling kmalloc
recursively, which was quite messy and hard to reason about.
This patch redesigns heap expansion so that it's kmalloc-free:
- We make a single large virtual range allocation at startup
- When expanding, we bump allocate VM from that region
- When expanding, we populate page tables directly ourselves,
instead of going via MemoryManager.
This makes heap expansion a great deal simpler. However, do note that it
introduces two new flaws that we'll need to deal with eventually:
- The single virtual range allocation is limited to 64 MiB and once
exhausted, kmalloc() will fail. (Actually, it will PANIC for now..)
- The kmalloc heap can no longer shrink once expanded. Subheaps stay
in place once constructed.
This was used to return a pre-locked UDPSocket in one place, but there
was really no need for that mechanism in the first place since the
caller ends up locking the socket anyway.
The function to protect ksyms after initialization, is only used during
boot of the system, so it can be UNMAP_AFTER_INIT as well.
This requires we switch the order of the init sequence, so we now call
`MM.protect_ksyms_after_init()` before `MM.unmap_text_after_init()`.
As a small cleanup, this also makes `page_round_up` verify its
precondition with `page_round_up_would_wrap` (which callers are expected
to call), rather than having its own logic.
Fixes#11297.
Most other syscalls pass address arguments as `void*` instead of
`uintptr_t`, so let's do that here too. Besides improving consistency,
this commit makes `strace` correctly pretty-print these arguments in
hex.
This feature was introduced in version 4.17 of the Linux kernel, and
while it's not specified by POSIX, I think it will be a nice addition to
our system.
MAP_FIXED_NOREPLACE provides a less error-prone alternative to
MAP_FIXED: while regular fixed mappings would cause any intersecting
ranges to be unmapped, MAP_FIXED_NOREPLACE returns EEXIST instead. This
ensures that we don't corrupt our process's address space if something
is already at the requested address.
Note that the more portable way to do this is to use regular
MAP_ANONYMOUS, and check afterwards whether the returned address matches
what we wanted. This, however, has a large performance impact on
programs like Wine which try to reserve large portions of the address
space at once, as the non-matching addresses have to be unmapped
separately.
This error only ever gets propagated to the userspace if
MAP_FIXED_NOREPLACE is requested, as MAP_FIXED unmaps intersecting
ranges beforehand, and non-fixed mmap() calls will just fall back to
allocating anywhere.
Linux specifies MAP_FIXED_NOREPLACE to return EEXIST when it can't
allocate, we now match that behavior.
Previously we were assigning to Process::m_space before actually
entering the new address space (assigning it to CR3.)
If a thread was preempted by the scheduler while destroying the old
address space, we'd then attempt to resume the thread with CR3 pointing
at a partially destroyed address space.
We could then crash immediately in write_cr3(), right after assigning
the new value to CR3. I am hopeful that this may have been the bug
haunting our CI for months. :^)
Instead, wait until we transition back to userspace. This stops
userspace from being able to suspend a thread indefinitely while it's
running in kernelspace (potentially holding some blocking mutex.)
Add the `posix_madvise(..)` LibC implementation that just forwards
to the normal `madvise(..)` implementation.
Also define a few POSIX_MADV_DONTNEED and POSIX_MADV_NORMAL as they
are part of the POSIX API for `posix_madvise(..)`.
This is needed by the `fio` port.
These 2 members are required by POSIX and are also used by some ports.
Zero is a valid value for both of these, so no further work to support
them is required.
The Prekernel's memory is only accessed until MemoryManager has been
initialized. Keeping them around afterwards is both unnecessary and bad,
as it prevents the userland from using the 0x100000-0x155000 virtual
address range.
Co-authored-by: Idan Horowitz <idan.horowitz@gmail.com>
As PROT_NONE regions can't be accessed by processes, and their only real
use is for reserving ranges of virtual memory, there's no point in
including them in coredumps.
Since this range is mapped in already in the kernel page directory, we
can initialize it before jumping into the first kernel process which
lets us avoid mapping in the range into init_stage2's address space.
This brings us half-way to removing the shared bottom 2 MiB mapping in
every process, leaving only the Prekernel.
The sa_family field in SIOCGIFHWADDR specifies the underlying network
interface's device type, this is hardcoded to generic "Ethernet" right
now, as we don't have a nice way to query it.
In order to reduce our reliance on __builtin_{ffs, clz, ctz, popcount},
this commit removes all calls to these functions and replaces them with
the equivalent functions in AK/BuiltinWrappers.h.
Not much to say here, this is an implementation of this call that
accesses the actual limit constant that's used by the VirtualFileSystem
class.
As a side note, this is required for my eventual Qt port.
As pointed out by BertalanD on Discord, POSIX specifies that
_SC_SYMLOOP_MAX (implemented in the following commit) always needs to be
equal or more than _POSIX_SYMLOOP_MAX (8, defined in
LibC/bits/posix1_lim.h), hence I've increased it to that value to
comply with the standard.
The move to header is required for the following commit - to make this
constant accessible outside of the VFS class, namely in sysconf.
For setreuid and setresuid syscalls, -1 means to set the current
uid/euid/gid/egid value, to be more convenient for programming.
However, for other syscalls where we pass only one argument, there's no
justification to specify -1.
This behavior is identical to how Linux handles the value -1, and is
influenced by the fact that the manual pages for the group of one
argument syscalls that handle ID operations is ambiguous about this
topic.
Unsurprisingly, the /proc/PID/stacks/TID stack walk had the same
arbitrary memory read problem as the perf event stack walk.
It would be nice if the kernel had a single stack walk implementation,
but that's outside the scope of this commit.
Since perfcore files can be generated during process finalization,
we can't just allow them to contain sensitive kernel information
if they're gonna be owned by the process's own UID+GID.
So instead, perfcores are now owned by 0:0. This is not the most
ergonomic solution, but I'm not sure what we could do to make it nicer.
We'll have to think more about that. In the meantime, this patches up
a kernel info leak. :^)
When walking the stack to generate a perf_event sample, we now check
if a userspace stack frame points back into kernel memory.
It was possible to use this as an arbitrary kernel memory read. :^)
We can leave the .ksyms section mapped-but-read-only and then have the
symbols index simply point into it.
Note that we manually insert null-terminators into the symbols section
while parsing it.
This gets rid of ~950 KiB of kmalloc_eternal() at startup. :^)