While setting up the main thread stack for a new process, we'd incur
some zero-fill page faults. This was to be expected, since we allocate
a huge stack but lazily populate it with physical pages.
The problem is that page fault handlers may enable interrupts in order
to grab a VMObject lock (or to page in from an inode.)
During exec(), a process is reorganizing itself and will be in a very
unrunnable state if the scheduler should interrupt it and then later
ask it to run again. Which is exactly what happens if the process gets
pre-empted while the new stack's zero-fill page fault grabs the lock.
This patch fixes the issue by creating new main thread stacks before
disabling interrupts and going into the critical part of exec().
To enforce this, we create two separate mappings of the same underlying
physical page. A writable mapping for the kernel, and a read-only one
for userspace (the one returned by sys$get_kernel_info_page.)
This patch adds a single "kernel info page" that is mappable read-only
by any process and contains the current time of day.
This is then used to implement a version of gettimeofday() that doesn't
have to make a syscall.
To protect against race condition issues, the info page also has a
serial number which is incremented whenever the kernel updates the
contents of the page. Make sure to verify that the serial number is the
same before and after reading the information you want from the page.
Every process keeps its own ELF executable mapped in memory in case we
need to do symbol lookup (for backtraces, etc.)
Until now, it was mapped in a way that made it accessible to the
program, despite the program not having mapped it itself.
I don't really see a need for userspace to have access to this right
now, so let's lock things down a little bit.
This patch makes it inaccessible to userspace and exposes that fact
through /proc/PID/vm (per-region "user_accessible" flag.)
Currently only Ext2FS and TmpFS supports InodeWatchers. We now fail
with ENOTSUPP if watch_file() is called on e.g ProcFS.
This fixes an issue with FileManager chewing up all the CPU when /proc
was opened. Watchers don't keep the watched Inode open, and when they
close, the watcher FD will EOF.
Since nothing else kept /proc open in FileManager, the watchers created
for it would EOF immediately, causing a refresh over and over.
Fixes#879.
The kernel now supports basic profiling of all the threads in a process
by calling profiling_enable(pid_t). You finish the profiling by calling
profiling_disable(pid_t).
This all works by recording thread stacks when the timer interrupt
fires and the current thread is in a process being profiled.
Note that symbolication is deferred until profiling_disable() to avoid
adding more noise than necessary to the profile.
A simple "/bin/profile" command is included here that can be used to
start/stop profiling like so:
$ profile 10 on
... wait ...
$ profile 10 off
After a profile has been recorded, it can be fetched in /proc/profile
There are various limits (or "bugs") on this mechanism at the moment:
- Only one process can be profiled at a time.
- We allocate 8MB for the samples, if you use more space, things will
not work, and probably break a bit.
- Things will probably fall apart if the profiled process dies during
profiling, or while extracing /proc/profile
This patch makes SharedBuffer use a PurgeableVMObject as its underlying
memory object.
A new syscall is added to control the volatile flag of a SharedBuffer.
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():
- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)
When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)
Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
Using int was a mistake. This patch changes String, StringImpl,
StringView and StringBuilder to use size_t instead of int for lengths.
Obviously a lot of code needs to change as a result of this.
The main thread of each kernel/user process will take the name of
the process. Extra threads will get a fancy new name
"ProcessName[<tid>]".
Thread backtraces now list the thread name in addtion to tid.
Add the thread name to /proc/all (should it get its own proc
file?).
Add two new syscalls, set_thread_name and get_thread_name.
This patch makes it possible to make memory regions non-readable.
This is enforced using the "present" bit in the page tables.
A process that hits an not-present page fault in a non-readable
region will be crashed.
This patch introduces code generation for the WindowServer IPC with
its clients. The client/server endpoints are defined by the two .ipc
files in Servers/WindowServer/: WindowServer.ipc and WindowClient.ipc
It now becomes significantly easier to add features and capabilities
to WindowServer since you don't have to know nearly as much about all
the intricate paths that IPC messages take between LibGUI and WSWindow.
The new system also uses significantly less IPC bandwidth since we're
now doing packed serialization instead of passing fixed-sized structs
of ~600 bytes for each message.
Some repaint coalescing optimizations are lost in this conversion and
we'll need to look at how to implement those in the new world.
The old CoreIPC::Client::Connection and CoreIPC::Server::Connection
classes are removed by this patch and replaced by use of ConnectionNG,
which will be renamed eventually.
Goodbye, old WindowServer IPC. You served us well :^)
Kernel modules can now be unloaded via a syscall. They get a chance to
run some code of course. Before deallocating them, we call their
"module_fini" symbol.
It's now possible to load a .o file into the kernel via a syscall.
The kernel will perform all the necessary ELF relocations, and then
call the "module_init" symbol in the loaded module.
Then only allow regions with that bit to be manipulated via munmap()
and mprotect(). This prevents messing with non-mmap()ed regions in
a process's address space (stacks, shared buffers, ...)
Remove explicit checking for pending signals from writing code paths,
since this is handled automatically when blocking, and should not
happen if the write() call is "short", i.e. doesn't block. All the
other syscalls already work like this.
Fixes https://github.com/SerenityOS/serenity/issues/797
Add an initial implementation of pthread attributes for:
* detach state (joinable, detached)
* schedule params (just priority)
* guard page size (as skeleton) (requires kernel support maybe?)
* stack size and user-provided stack location (4 or 8 MB only, must be aligned)
Add some tests too, to the thread test program.
Also, LibC: Move pthread declarations to sys/types.h, where they belong.
This can be implemented entirely in userspace by calling tcgetattr().
To avoid screwing up the syscall indexes, this patch also adds a
mechanism for removing a syscall without shifting the index of other
syscalls.
Note that ports will still have to be rebuilt after this change,
as their LibC code will try to make the isatty() syscall on startup.
Have pthread_create() allocate a stack and passing it to the kernel
instead of this work happening in the kernel. The more of this we can
do in userspace, the better.
This patch also unexposes the raw create_thread() and exit_thread()
syscalls since they are now only used by LibPthread anyway.
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.
If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.
Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
It's now possible to block until another thread in the same process has
exited. We can also retrieve its exit value, which is whatever value it
passed to pthread_exit(). :^)
While executing in the kernel, a thread can acquire various resources
that need cleanup, such as locks and references to RefCounted objects.
This cleanup normally happens on the exit path, such as in destructors
for various RAII guards. But we weren't calling those exit paths when
killing threads that have been executing in the kernel, such as threads
blocked on reading or sleeping, thus causing leaks.
This commit changes how killing threads works. Now, instead of killing
a thread directly, one is supposed to call thread->set_should_die(),
which will unblock it and make it unwind the stack if it is blocked
in the kernel. Then, just before returning to the userspace, the thread
will automatically die.
This patch adds pthread_create() and pthread_exit(), which currently
simply wrap our existing create_thread() and exit_thread() syscalls.
LibThread is also ported to using LibPthread.
Some syscalls have to pass parameters through a struct, since we can
only fit 3 parameters with our calling convention.
This patch makes use of C++ structured binding to clean up the places
where we expand those parameters structs into local variables.
POSIX's openat() is very similar to open(), except you also provide a
file descriptor referring to a directory from which relative paths
should be resolved.
Passing it the magical fd number AT_FDCWD means "resolve from current
directory" (which is indeed also what open() normally does.)
This fixes libarchive's bsdtar, since it was trying to do something
extremely wrong in the absence of openat() support. The issue has
recently been fixed upstream in libarchive:
https://github.com/libarchive/libarchive/issues/1239
However, we should have openat() support anyway, so I went ahead and
implemented it. :^)
Fixes#748.
Don't wait for someone to wait() on a dead process before releasing its
TTY object. This fixes the child process death detection used by the
Terminal application, which relies on getting an EOF on the master PTY
in order to know it's time to wait() on the child process. :^)
Instead of the big ugly switch statement, build a lookup table using
the syscall enumeration macro.
This greatly simplifies the syscall implementation. :^)
Scheduling priority is now set at the thread level instead of at the
process level.
This is a step towards allowing processes to set different priorities
for threads. There's no userspace API for that yet, since only the main
thread's priority is affected by sched_setparam().
This patch changes the parameter to Region::map() to be a PageDirectory
since that matches how we think about the memory model:
Regions are views onto VMObjects, and are mapped into PageDirectories.
Each Process has a PageDirectory. The kernel also has a PageDirectory.