Process teardown is divided into two main stages: finalize and reap.
Finalization happens in the "Finalizer" kernel and runs with interrupts
enabled, allowing destructors to take locks, etc.
Reaping happens either in sys$waitid() or in the scheduler for orphans.
The more work we can do in finalization, the better, since it's fully
pre-emptible and reduces the amount of time the system runs without
interrupts enabled.
Calling shutdown prevents further reads and/or writes on a socket.
We should do a few more things based on the type of socket, but this
initial implementation just puts the basic mechanism in place.
Work towards #428.
If there's not enough space in the output buffer for the whole sockaddr
we now simply truncate the address instead of returning EINVAL.
This patch also makes getpeername() actually return the peer address
rather than the local address.. :^)
According to POSIX, waitid() should fill si_signo and si_pid members
with zeroes if there are no children that have already changed their
state by the time of the call. Let's just fill the whole structure
with zeroes to avoid leaking kernel memory.
sys$waitid() takes an explicit description of whether it's waiting for a single
process with the given PID, all of the children, a group, etc., and returns its
info as a siginfo_t.
It also doesn't automatically imply WEXITED, which clears up the confusion in
the kernel.
This patch introduces sys$perf_event() with two event types:
- PERF_EVENT_MALLOC
- PERF_EVENT_FREE
After the first call to sys$perf_event(), a process will begin keeping
these events in a buffer. When the process dies, that buffer will be
written out to "perfcore" in the current directory unless that filename
is already taken.
This is probably not the best way to do this, but it's a start and will
make it possible to start doing memory allocation profiling. :^)
Before putting itself back on the wait queue, the finalizer task will
now check if there's more work to do, and if so, do it first. :^)
This patch also puts a bunch of process/thread debug logging behind
PROCESS_DEBUG and THREAD_DEBUG since it was unbearable to debug this
stuff with all the spam.
If the waitee process is dead, we don't need to inspect the thread.
This fixes an issue with sys$waitpid() failing before reap() since
dead processes will have no remaining threads alive.
There was a race window in a bunch of syscalls between calling
Thread::from_tid() and checking if the found thread was in the same
process as the calling thread.
If the found thread object was destroyed at that point, there was a
use-after-free that could be exploited by filling the kernel heap with
something that looked like a thread object.
Memory validation is used to verify that user syscalls are allowed to
access a given memory range. Ring 0 threads never make syscalls, and
so will never end up in validation anyway.
The reason we were allowing kmalloc memory accesses is because kernel
thread stacks used to be allocated in kmalloc memory. Since that's no
longer the case, we can stop making exceptions for kmalloc in the
validation code.
Move timeout management to the ReadBlocker and WriteBlocker classes.
Also get rid of the specialized ReceiveBlocker since it no longer does
anything that ReadBlocker can't do.
Vector::ensure_capacity() makes sure the underlying vector buffer can
contain all the data, but it doesn't update the Vector::size().
As a result, writev() would simply collect all the buffers to write,
and then do nothing.
Move all the fork-specific inheritance logic to sys$fork(), and all the
stuff for setting up stdio for non-fork ring 3 processes moves to
Process::create_user_process().
Also: we were setting up the PGID, SID and umask twice. Also the code
for copying the open file descriptors was overly complicated. Now it's
just a simple Vector copy assignment. :^)
When using dbg() in the kernel, the output is automatically prefixed
with [Process(PID:TID)]. This makes it a lot easier to understand which
thread is generating the output.
This patch also cleans up some common logging messages and removes the
now-unnecessary "dbg() << *current << ..." pattern.
Sergey suggested that having a non-zero O_RDONLY would make some things
less confusing, and it seems like he's right about that.
We can now easily check read/write permissions separately instead of
dancing around with the bits.
This patch also fixes unveil() validation for O_RDWR which previously
forgot to check for "r" permission.