It looks like they're considered a bad idea, so let's not add
them before we need them. I figured it's good to have them in
git history if we ever do need them though, hence the add/remove
dance.
Add seteuid()/setegid() under _POSIX_SAVED_IDS semantics,
which also requires adding suid and sgid to Process, and
changing setuid()/setgid() to honor these semantics.
The exact semantics aren't specified by POSIX and differ
between different Unix implementations. This patch makes
serenity follow FreeBSD. The 2002 USENIX paper
"Setuid Demystified" explains the differences well.
In addition to seteuid() and setegid() this also adds
setreuid()/setregid() and setresuid()/setresgid(), and
the accessors getresuid()/getresgid().
Also reorder uid/euid functions so that they are the
same order everywhere (namely, the order that
geteuid()/getuid() already have).
That's not how readlink() is supposed to work: it should copy as many bytes
as fit into the buffer, and return the number of bytes copied. So do that,
but add a twist: make sys$readlink() actually return the whole size, not
the number of bytes copied. We fix up this return value in userspace, to make
LibC's readlink() behave as expected, but this will also allow other code
to allocate a buffer of just the right size.
Also, avoid an extra copy of the link target.
Since we're not keeping compatibility with OpenBSD about what promises are
required for which syscalls, tighten things up so that they make more sense.
This adds support for MS_RDONLY, a mount flag that tells the kernel to disallow
any attempts to write to the newly mounted filesystem. As this flag is
per-mount, and different mounts of the same filesystems (such as in case of bind
mounts) can have different mutability settings, you have to go though a custody
to find out if the filesystem is mounted read-only, instead of just asking the
filesystem itself whether it's inherently read-only.
This also adds a lot of checks we were previously missing; and moves some of
them to happen after more specific checks (such as regular permission checks).
One outstanding hole in this system is sys$mprotect(PROT_WRITE), as there's no
way we can know if the original file description this region has been mounted
from had been opened through a readonly mount point. Currently, we always allow
such sys$mprotect() calls to succeed, which effectively allows anyone to
circumvent the effect of MS_RDONLY. We should solve this one way or another.
If we fail to exec() the target executable, don't leak the thread (this actually
triggers an assertion when destructing the process), and print an error message.
When mounting Ext2FS, we don't care if the file has a custody (it doesn't if
it's a device, which is a common case). When doing a bind-mount, we do need a
custody; if none is provided, let's return an error instead of crashing.
And move canonicalized_path() to a static method on LexicalPath.
This is to make it clear that FileSystemPath/canonicalized_path() only
perform *lexical* canonicalization.
You now have to pledge "sigaction" to change signal handlers/dispositions. This
is to prevent malicious code from messing with assertions (and segmentation
faults), which are normally expected to instantly terminate the process but can
do other things if you change signal disposition for them.
The is_error() check on the KResultOr returned when reading the link
target had a stray ! operator which causes link resolution to crash the
kernel with an assertion error.
Allow file system implementation to return meaningful error codes to
callers of the FileDescription::read_entire_file(). This allows both
Process::sys$readlink() and Process::sys$module_load() to return more
detailed errors to the user.
This was a holdover from the old times when each Process had a special
main thread with TID 0. Using it was a total crapshoot since it would
just return whichever thread was first on the process's thread list.
Now that I've removed all uses of it, we don't need it anymore. :^)
Instead of falling back to the suspicious "any_thread()" mechanism,
just fail with ESRCH if you try to kill() a PID that doesn't have a
corresponding TID.
This was supposed to be the foundation for some kind of pre-kernel
environment, but nobody is working on it right now, so let's move
everything back into the kernel and remove all the confusion.
We stopped using gettimeofday() in Core::EventLoop a while back,
in favor of clock_gettime() for monotonic time.
Maintaining an optimization for a syscall we're not using doesn't make
a lot of sense, so let's go back to the old-style sys$gettimeofday().
You can still open files that have sockets attached to them from inside
the kernel via VFS::open() (and in fact, that is what LocalSocket itslef uses),
but trying to do that from userspace using open() will now fail with ENXIO.
Ultimately we should not panic just because we can't fully commit a VM
region (by populating it with physical pages.)
This patch handles some of the situations where commit() can fail.
This patch adds PageFaultResponse::OutOfMemory which informs the fault
handler that we were unable to allocate a necessary physical page and
cannot continue.
In response to this, the kernel will crash the current process. Because
we are OOM, we can't symbolicate the crash like we normally would
(since the ELF symbolication code needs to allocate), so we also
communicate to Process::crash() that we're out of memory.
Now we can survive "allocate 300 MB" (only the allocate process dies.)
This is definitely not perfect and can easily end up killing a random
innocent other process who happened to allocate one page at the wrong
time, but it's a *lot* better than panicking on OOM. :^)
Utilize the new Thread::wait_on timeout parameter to implement
timeout support for FUTEX_WAIT.
As we compute the relative time from the user specified absolute
time, we try to delay that computation as long as possible before
we call into Thread::wait_on(..). To enable this a small bit of
refactoring was done pull futex_queue fetching out and timeout fetch
and calculation separation.
This is a special case that was previously not implemented.
The idea is that you can dispatch a signal to all other processes
the calling process has access to.
There was some minor refactoring to make the self signal logic
into a function so it could easily be easily re-used from do_killall.
Previously, when returning from a pthread's start_routine, we would
segfault. Now we instead implicitly call pthread_exit as specified in
the standard.
pthread_create now creates a thread running the new
pthread_create_helper, which properly manages the calling and exiting
of the start_routine supplied to pthread_create. To accomplish this,
the thread's stack initialization has been moved out of
sys$create_thread and into the userspace function create_thread.
POSIX says, "Conforming applications should not assume that the returned
contents of the symbolic link are null-terminated."
If we do include the null terminator into the returning string, Python
believes it to actually be a part of the returned name, and gets unhappy
about that later. This suggests other systems Python runs in don't include
it, so let's do that too.
Also, make our userspace support non-null-terminated realpath().