This post is about your sleep quality, or not yours, but your program’s.
Thoughout this post, implementation details of MOS are used as examples, but the concepts should be applicable to other OS kernels (e.g. Linux).
It is always said…
Upon a signal being delivered to a process, the kernel interrupts the process and invokes the signal handler. If the handler returns, the kernel resumes the process from the instruction where it was interrupted.
…but how?
Signals 链接到标题
A signal is like a notification sent to a process to inform it that something
interesting has occurred:
|
|
Applications can decide either to ignore signals or to handle them by installing
signal handlers using sigaction()
, or an outdated API, signal()
.
|
|
When are Signals Delivered? 链接到标题
According to the POSIX standard:
Execution of signal handlers:
Whenever there is a transition from kernel-mode to user-mode execution, e.g., on:
- return from a system call, or
- scheduling of a thread onto the CPU.
The kernel checks whether there is a pending unblocked signal for which the process has established a signal handler. If there is such a pending signal, the following steps occur…
Delivering signals is straightforward when it’s the case of the second bullet point, as shown in the code below:
|
|
more about unreachable().
How about the first one?
The first bullet point effectively implies that a signal must occur after a system call returns.
- What if that system call lasts a long time?
- What if, for the worst case, the system call just never returns?
In those cases, how can the kernel deliver signals in time? Before moving on, let’s focus on how a system call is executed, and how long it can last.
How Long Can a System Call Last? 链接到标题
A system call can last for a long time, for example, read()
can last
forever if the file is never written to (imagine a named pipe), or the
user never presses any key on the keyboard.
|
|
To make fair and efficient use of the CPU, read()
and similar syscalls are
not implemented as a busy waiting loop, which just wastes CPU cycles.
Instead, it lets other threads run and comes back only when required resources
become available.
|
|
In MOS, a waitlist
is used to keep track of threads that are waiting for resources:
|
|
What Wakes You Up? 链接到标题
When the resource becomes available, the resource owner will iterate through its
waitlist
and set the thread’s state to READY
. The thread will then enter the
scheduling candidate queue and is ready to run again in the next scheduling round.
|
|
There seems to be nothing notable. When the thread wakes up, it simply continues to execute the syscall with its desired resource available.
Signals Ruin Everything 链接到标题
What Happens When a Signal is Delivered?
‘signal handler is invoked…’ — POSIX
In MOS, whenever a signal is delivered to a thread, it is stored in a per-thread
linked list called sigpending
, and the thread’s state is set to READY
.
|
|
To get signals handled as soon as possible, the thread should leave the syscall and handle the signal first.
So…
A check-and-return is added in the syscall to handle wakeups like this. If the thread is interrupted by a signal, it should return immediately, without executing further code in the syscall.
|
|
The signal handler is then invoked as normal, followed by a sigreturn
trampoline
to return to a ‘pre-signal context’.
Where are we returning to? 链接到标题
Tl;dr: The instruction after the syscall instruction, with -EINTR
as the return
value.
The entire flow is like this:
|
|
The program will get the return value of the syscall, but the return value is -EINTR
,
which means the syscall is interrupted by a signal.
This return value is rarely a desired one, a program in this case should check for the return value and decide what to do next. If the program wants to retry the syscall, it should call syscall again with the same arguments.
|
|
The code above looks like a boilerplate, and it is. It is also error-prone, as the programmer may forget to check the return value and retry the syscall.
Can we do better? 链接到标题
Given the fact that some signals are only ‘informative’, and has no side-effects on neither the program nor the kernel, it is safe to restart the interrupted syscall automatically after signal handler returns.
Examples of these signals are SIGCHLD
and SIGWINCH
.
Is there a way to restart such syscalls automatically, so that programmers don’t
have to deal with -EINTR
return values?
Short answer: Yes.
SA_RESTART and siginterrupt(3p) 链接到标题
In POSIX, one can tell the kernel to restart the interrupted syscall automatically
upon receiving a signal by calling siginterrupt(3p)
. Or by setting the SA_RESTART
flag in sigaction(2)
.
|
|
In MOS (specifically, in mlibc
). The first one is a compatibility wrapper of
the latter.
What does the kernel need to know, to restart the syscall? 链接到标题
The kernel needs to know at least these things to restart the syscall:
-
The syscall is interrupted by a signal.
This one is easy, upon receiving a signal, syscall handlers can return-EINTR
to indicate that the syscall is interrupted by a signal. -
The user wants to restart an interrupted syscall for this signal.
This is also trivial, the kernel can simply check theSA_RESTART
flag in thesigaction
struct. -
The exact syscall number and arguments to restart.
This is kinda easy, they are already in the registers when the syscall is interrupted. -
The syscall is ‘restartable’, i.e. the syscall makes sense to be restarted.
This is the most difficult one. A special return value is needed to distinguish whether the syscall wants itself to be restarted or not.-EINTR
is not a good choice, we need a new return value.
-ERESTARTSYS Comes to the Rescue 链接到标题
In Linux, the kernel returns -ERESTARTSYS
to indicate that the syscall is
interrupted by a signal, and the syscall is restartable. This is also the case
in MOS.
|
|
The signal checker is also modified to check if -ERESTARTSYS
is returned and if
SA_RESTART
is set:
|
|
Note that -ERESTARTSYS
is not leaked to userspace, it is solely used by the
kernel to indicate that the syscall is interrupted by a signal and is restartable.
What happens here is:
- after a syscall function detects that the syscall is interrupted by a signal
- it returns
-ERESTARTSYS
or-EINTR
- based on whether it is restartable or not
- it returns
- before returning to userspace, signal handling code checks:
- if a signal is pending, yes in this case
- if
SA_RESTART
for this signal is set- if yes, modify the context so that the syscall can be restarted
- automatically (see below)
- if no, return
-EINTR
to the userspace program (instead of-ERESTARTSYS
)- because the user does not want to restart the syscall
- if yes, modify the context so that the syscall can be restarted
- invoke the signal handler
- if the signal handler returns by
sigreturn()
- restore pre-signal context
- return to userspace
- the syscall is restarted (automatically) if
SA_RESTART
is set
How This Works? 链接到标题
The set_syscall_restart()
function is modifies the context of the thread
(i.e. registers) so that the syscall can be restarted automatically.
It achieves this by playing with the instruction pointer:
|
|
It places the syscall number in rax
, the register that stores the syscall number
in MOS, and decrements the instruction pointer by 2.
|
|
2 in this case is the length of the syscall
instruction (or int 0x88
) when encoded
in x86-64. Whether it is a coincidence or not, the lengths of these two instructions
are the same.
After the context has been modified, when the thread returns to userspace, instead
of going to the next instruction, it will execute the syscall
(or int 0x88
)
instruction again. From the outside, it looks like the syscall is restarted
automatically.
The overall flow of automatic syscall restart is in the diagram below:
|
|
Why mangling the instruction pointer? 链接到标题
Several other approaches exist, for example:
-
store the syscall number, arguments somewhere in the thread’s context, and call the syscall in kernel mode from
sigreturn()
- it requires more space to store all arguments (10+ registers * 8 bytes)
- it is also unnecessary to store them somewhere else, given they are already in the registers when the syscall is interrupted.
- syscall entry point is sometimes a tracepoint, it helps the kernel to trace syscalls’ entry and exit, and it’s unsuitable to be invoked from kernel mode code.
-
place the
-EINTR
return value checker inlibc
and restart the syscall from there- it requires
libc
to be aware of the syscall restart mechanism - it is not a good idea to put syscall restart logic in
libc
, as it is a userspace library, and it is not the only one. Caller libraries may also want to handle-EINTR
return values differently.
- it requires
Concerns and Limitations 链接到标题
-
Imagine a syscall that takes a pointer to a buffer as its argument, and the buffer is modified by the syscall. If the syscall is interrupted by a signal, and the syscall is restarted automatically, the buffer will be modified twice.
- This senario just doen’t exist.
-EINTR
and-ERESTARTSYS
are only returned when the syscall is interrupted and when the syscall haven’t done anything yet. Examples of these syscalls includesread(2)
, if any data is read, the syscall will return immediately, with the number of bytes read instead of-EINTR
or-ERESTARTSYS
.
- This senario just doen’t exist.
-
What if the syscall is restarted automatically, and the signal interrupts it again?
The syscall function doesn’t know whether it is restarted automatically or it’s the first time it is called. Interrupting it again will result in the same behaviour as interrupting it the first time.
-
Certain system calls are not restartable, for example,
nanosleep(2)
,nanosleep(2)
takes astruct timespec
as its argument, and the kernel needs to calculate the remaining time to sleep and restart the syscall with the remaining time.
Conclusion 链接到标题
In this post, signals, and how they interrupt syscalls are discussed. The automatic restart of signal-interrupted syscalls is also discussed.
The MOS kernel implements automatic restart of signal-interrupted syscalls in commit 5ba8cc0b5935c608cf4490cc6035ab30649b8db3
References 链接到标题
- MOS Source Code
- Linux Source Code
- POSIX.1-2017
signal(7)