This is similar to 28e1da344d
and 4dd4dd2f3c.
The crux is that wait verifies that the outvalue (siginfo* infop)
is writable *before* waiting, and writes to it *after* waiting.
In the meantime, a concurrent thread can make the output region
unwritable, e.g. by deallocating it.
This is similar to 28e1da344d
and 4dd4dd2f3c.
The crux is that select verifies that the filedescriptor sets
are writable *before* blocking, and writes to them *after* blocking.
In the meantime, a concurrent thread can make the output buffer
unwritable, e.g. by deallocating it.
This is a complete fix of clock_nanosleep, because the thread holds the
process lock again when returning from sleep()/sleep_until().
Therefore, no further concurrent invalidation can occur.
Also, duplicate data in dbg() and klog() calls were removed.
In addition, leakage of virtual address to kernel log is prevented.
This is done by replacing kprintf() calls to dbg() calls with the
leaked data instead.
Also, other kprintf() calls were replaced with klog().
This was only used by the mechanism for mapping executables into each
process's own address space. Now that we remap executables on demand
when needed for symbolication, this can go away.
Previously we would map the entire executable of a program in its own
address space (but make it unavailable to userspace code.)
This patch removes that and changes the symbolication code to remap
the executable on demand (and into the kernel's own address space
instead of the process address space.)
This opens up a couple of further simplifications that will follow.
I had the wrong idea about this. Thanks to Sergey for pointing it out!
Here's what he says (reproduced for posterity):
> Private mappings protect the underlying file from the changes made by
> you, not the other way around. To quote POSIX, "If MAP_PRIVATE is
> specified, modifications to the mapped data by the calling process
> shall be visible only to the calling process and shall not change the
> underlying object. It is unspecified whether modifications to the
> underlying object done after the MAP_PRIVATE mapping is established
> are visible through the MAP_PRIVATE mapping." In practice that means
> that the pages that were already paged in don't get updated when the
> underlying file changes, and the pages that weren't paged in yet will
> load the latest data at that moment.
> The only thing MAP_FILE | MAP_PRIVATE is really useful for is mapping
> a library and performing relocations; it's definitely useless (and
> actively harmful for the system memory usage) if you only read from
> the file.
This effectively reverts e2697c2ddd.
This will be a memory usage pessimization until we actually implement
CoW sharing of the memory pages with SharedInodeVMObject.
However, it's a huge architectural improvement, so let's take it and
improve on this incrementally.
fork() should still be neutral, since all private mappings are CoW'ed.
It's now up to the caller to provide a VMObject when constructing a new
Region object. This will make it easier to handle things going wrong,
like allocation failures, etc.
If we wrote anything we should just inform userspace that we did,
and not worry about the error code. Userspace can call us again if
it wants, and we'll give them the error then.
We don't have to log the process name/PID/TID, dbg() automatically adds
that as a prefix to every line.
Also we don't have to do .characters() on Strings passed to dbg() :^)
You can now mmap a file as private and writable, and the changes you
make will only be visible to you.
This works because internally a MAP_PRIVATE region is backed by a
unique PrivateInodeVMObject instead of using the globally shared
SharedInodeVMObject like we always did before. :^)
Fixes#1045.
We now have PrivateInodeVMObject and SharedInodeVMObject, corresponding
to MAP_PRIVATE and MAP_SHARED respectively.
Note that PrivateInodeVMObject is not used yet.
Add an extra out-parameter to shbuf_get() that receives the size of the
shared buffer. That way we don't need to make a separate syscall to
get the size, which we always did immediately after.
This feels a lot more consistent and Unixy:
create_shared_buffer() => shbuf_create()
share_buffer_with() => shbuf_allow_pid()
share_buffer_globally() => shbuf_allow_all()
get_shared_buffer() => shbuf_get()
release_shared_buffer() => shbuf_release()
seal_shared_buffer() => shbuf_seal()
get_shared_buffer_size() => shbuf_get_size()
Also, "shared_buffer_id" is shortened to "shbuf_id" all around.
set_interrupted_by_death was never called whenever a thread that had
a joiner died, so the joiner remained with the joinee pointer there,
resulting in an assertion fail in JoinBlocker: m_joinee pointed to
a freed task, filled with garbage.
Thread::current->m_joinee may not be valid after the unblock
Properly return the joinee exit value to the joiner thread.
On 32-bit platforms, INT32_MIN == -INT32_MIN, so we can't expect this
to always work:
if (pid < 0)
positive_pid = -pid; // may still be negative!
This happens because the -INT32_MIN expression becomes a long and is
then truncated back to an int.
Fixes#1312.
This allows a process wich has more than 1 thread to call exec, even
from a thread. This kills all the other threads, but it won't wait for
them to finish, just makes sure that they are not in a running/runable
state.
In the case where a thread does exec, the new program PID will be the
thread TID, to keep the PID == TID in the new process.
This introduces a new function inside the Process class,
kill_threads_except_self which is called on exit() too (exit with
multiple threads wasn't properly working either).
Inside the Lock class, there is the need for a new function,
clear_waiters, which removes all the waiters from the
Process::big_lock. This is needed since after a exit/exec, there should
be no other threads waiting for this lock, the threads should be simply
killed. Only queued threads should wait for this lock at this point,
since blocked threads are handled in set_should_die.
Each process has a 1-level lookup cache for fast repeated lookups of
the same VM region (which tends to be the majority of lookups.)
The cache is used by the following syscalls: munmap, madvise, mprotect
and set_mmap_name.
After a succesful exec(), there could be a stale Region* in the lookup
cache, and the new executable was able to manipulate it using a number
of use-after-free code paths.
When committing to a new executable, disown any shared buffers that the
process was previously co-owning.
Otherwise accessing the same shared buffer ID from the new program
would cause the kernel to find a cached (and stale!) reference to the
previous program's VM region corresponding to that shared buffer,
leading to a Region* use-after-free.
Fixes#1270.
Since we're gonna throw away these stacks at the end of exec anyway,
we might as well disable profiling before starting to mess with the
process page tables. One less weird situation to worry about in the
sampling code.