Skip to content

Commit

Permalink
[WIP] [LibOS] Add support for timerfd system calls
Browse files Browse the repository at this point in the history
This commit adds support for system calls that create and operate on a
timer that delivers timer expiration notifications via a file
descriptor, specifically: `timerfd_create()`, `timerfd_settime()` and
`timerfd_gettime()`. The timerfd object is associated with a dummy
eventfd created on the host to trigger notifications (e.g., in epoll).
The object is created inside Gramine, with all it operations resolved
entirely inside Gramine.

The emulation is currently implemented at the level of a single process.
However, it may sometimes work for multi-process applications, e.g.,
if the child process inherits the timerfd object but doesn't use it;  to
support these cases, we introduce the
`sys.experimental__allow_timerfd_fork` manifest option.

LibOS regression tests are also added.

Signed-off-by: Kailun Qin <[email protected]>
  • Loading branch information
kailun-qin committed Jan 26, 2024
1 parent a933017 commit 56310a1
Show file tree
Hide file tree
Showing 33 changed files with 1,008 additions and 46 deletions.
28 changes: 20 additions & 8 deletions Documentation/devel/features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1036,7 +1036,7 @@ The below list is generated from the [syscall table of Linux
-`signalfd()`
<sup>[7](#signals-and-process-state-changes)</sup>

- `timerfd_create()`
- `timerfd_create()`
<sup>[20](#sleeps-timers-and-alarms)</sup>

-`eventfd()`
Expand All @@ -1045,10 +1045,10 @@ The below list is generated from the [syscall table of Linux
-`fallocate()`
<sup>[9a](#file-system-operations)</sup>

- `timerfd_settime()`
- `timerfd_settime()`
<sup>[20](#sleeps-timers-and-alarms)</sup>

- `timerfd_gettime()`
- `timerfd_gettime()`
<sup>[20](#sleeps-timers-and-alarms)</sup>

-`accept4()`
Expand Down Expand Up @@ -2862,9 +2862,21 @@ Gramine implements getting and setting the interval timer: `getitimer()` and `se

Gramine implements alarm clocks via `alarm()`.

Gramine implements timers that notify via file descriptors: `timerfd_create()`, `timerfd_settime()`
and `timerfd_gettime()`. The timerfd object is created inside Gramine, and all operations are
resolved entirely inside Gramine. Each timerfd object is associated with a dummy eventfd created on
the host. This is purely for triggering read/write notifications (e.g., in epoll); timerfd data is
verified inside Gramine and is never exposed to the host. Since the host is used purely for
notifications, a malicious host can only induce Denial of Service (DoS) attacks.

The emulation is currently implemented at the level of a single process. The emulation may work for
multi-process applications, e.g., if the child process inherits the timerfd object but doesn't use
it. However, multi-process support is brittle and thus disabled by default (Gramine will issue a
warning). To enable it still, set the [`sys.experimental__allow_timerfd_fork` manifest
option](../manifest-syntax.html#allowing-timerfd-in-multi-process-applications).

Gramine does *not* currently implement the POSIX per-process timer: `timer_create()`, etc. Gramine
also does not currently implement timers that notify via file descriptors. Gramine could implement
these timers in the future, if need arises.
could implement it in the future, if need arises.

<details><summary>Related system calls</summary>

Expand All @@ -2880,9 +2892,9 @@ these timers in the future, if need arises.
-`timer_getoverrun()`: may be implemented in the future
-`timer_delete()`: may be implemented in the future

- `timerfd_create()`: may be implemented in the future
- `timerfd_settime()`: may be implemented in the future
- `timerfd_gettime()`: may be implemented in the future
- `timerfd_create()`: see notes above
- `timerfd_settime()`: see notes above
- `timerfd_gettime()`: see notes above

</details><br />

Expand Down
16 changes: 16 additions & 0 deletions Documentation/manifest-syntax.rst
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,22 @@ Python). Could be useful in SGX environments: child processes consume
to achieve this, you need to run the whole Gramine inside a proper security
sandbox.
.. _timerfd-in-multi-process:

Allowing timerfd in multi-process applications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

sys.experimental__allow_timerfd_fork = [true|false]
(Default: false)

Gramine implements timerfd in a secure way, but this implementation works only
in single-process applications. If you have a multi-process application and you
are sure that the parent process and its child processes do not have
cross-process usage of timerfd, you can use
``sys.experimental__allow_timerfd_fork`` manifest syntax.

Root FS mount point
^^^^^^^^^^^^^^^^^^^

Expand Down
5 changes: 5 additions & 0 deletions libos/include/libos_fs.h
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,10 @@ struct libos_fs_ops {
/* Poll a single handle. Must not block. */
int (*poll)(struct libos_handle* hdl, int in_events, int* out_events);

/* Verify a single handle after poll. Must update `pal_ret_events` in-place with only allowed
* ones. Used in e.g. secure timerfd FS. */
void (*post_poll)(struct libos_handle* hdl, pal_wait_flags_t* pal_ret_events);

/* checkpoint/migrate the file system */
ssize_t (*checkpoint)(void** checkpoint, void* mount_data);
int (*migrate)(void* checkpoint, void** mount_data);
Expand Down Expand Up @@ -930,6 +934,7 @@ extern struct libos_fs eventfd_builtin_fs;
extern struct libos_fs synthetic_builtin_fs;
extern struct libos_fs path_builtin_fs;
extern struct libos_fs shm_builtin_fs;
extern struct libos_fs timerfd_builtin_fs;

struct libos_fs* find_fs(const char* name);

Expand Down
12 changes: 12 additions & 0 deletions libos/include/libos_handle.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ enum libos_handle_type {
/* Special handles: */
TYPE_EPOLL, /* epoll handles, see `libos_epoll.c` */
TYPE_EVENTFD, /* eventfd handles, used by `eventfd` filesystem */
TYPE_TIMERFD, /* timerfd handles, used by `timerfd` filesystem */
};

struct libos_pipe_handle {
Expand Down Expand Up @@ -134,6 +135,16 @@ struct libos_epoll_handle {
size_t last_returned_index;
};

struct libos_timerfd_handle {
spinlock_t expiration_lock; /* protecting below fields */
uint64_t num_expirations;
uint64_t dummy_host_val;

spinlock_t timer_lock;
uint64_t timeout;
uint64_t reset;
};

struct libos_handle {
enum libos_handle_type type;
bool is_dir;
Expand Down Expand Up @@ -204,6 +215,7 @@ struct libos_handle {

struct libos_epoll_handle epoll; /* TYPE_EPOLL */
struct { bool is_semaphore; } eventfd; /* TYPE_EVENTFD */
struct libos_timerfd_handle timerfd; /* TYPE_TIMERFD */
} info;

struct libos_dir_handle dir_info;
Expand Down
4 changes: 4 additions & 0 deletions libos/include/libos_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -262,3 +262,7 @@ int init_stack(const char* const* argv, const char* const* envp, char*** out_arg
* The implementation of this function depends on the used architecture.
*/
noreturn void call_elf_entry(elf_addr_t entry, void* argp);

extern bool g_timerfd_allow_fork;
extern uint32_t g_timerfd_cnt;
int init_timerfd(void);
4 changes: 4 additions & 0 deletions libos/include/libos_table.h
Original file line number Diff line number Diff line change
Expand Up @@ -206,3 +206,7 @@ long libos_syscall_getcpu(unsigned* cpu, unsigned* node, void* unused_cache);
long libos_syscall_getrandom(char* buf, size_t count, unsigned int flags);
long libos_syscall_mlock2(unsigned long start, size_t len, int flags);
long libos_syscall_sysinfo(struct sysinfo* info);
long libos_syscall_timerfd_create(int clockid, int flags);
long libos_syscall_timerfd_settime(int fd, int flags, const struct __kernel_itimerspec* value,
struct __kernel_itimerspec* ovalue);
long libos_syscall_timerfd_gettime(int fd, struct __kernel_itimerspec* value);
2 changes: 1 addition & 1 deletion libos/include/libos_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ int create_pipe(char* name, char* uri, size_t size, PAL_HANDLE* hdl, bool use_vm

/* Asynchronous event support */
int init_async_worker(void);
int64_t install_async_event(PAL_HANDLE object, unsigned long time,
int64_t install_async_event(PAL_HANDLE object, unsigned long time, bool absolute_time,
void (*callback)(IDTYPE caller, void* arg), void* arg);
struct libos_thread* terminate_async_worker(void);

Expand Down
17 changes: 17 additions & 0 deletions libos/include/linux_abi/timerfd.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/* SPDX-License-Identifier: LGPL-3.0-or-later */
/* Copyright (C) 2024 Intel Corporation
* Kailun Qin <[email protected]>
*/

#pragma once

/* Types and structures used by various Linux ABIs (e.g. syscalls). */
/* These need to be binary-identical with the ones used by Linux. */

#include <linux/timerfd.h>

#define TFD_SHARED_FCNTL_FLAGS (TFD_CLOEXEC | TFD_NONBLOCK)
/* Flags for timerfd_create. */
#define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
/* Flags for timerfd_settime. */
#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_TIMER_CANCEL_ON_SET)
6 changes: 3 additions & 3 deletions libos/src/arch/x86_64/libos_table.c
Original file line number Diff line number Diff line change
Expand Up @@ -297,11 +297,11 @@ libos_syscall_t libos_syscall_table[LIBOS_SYSCALL_BOUND] = {
[__NR_utimensat] = (libos_syscall_t)0, // libos_syscall_utimensat
[__NR_epoll_pwait] = (libos_syscall_t)libos_syscall_epoll_pwait,
[__NR_signalfd] = (libos_syscall_t)0, // libos_syscall_signalfd
[__NR_timerfd_create] = (libos_syscall_t)0, // libos_syscall_timerfd_create
[__NR_timerfd_create] = (libos_syscall_t)libos_syscall_timerfd_create,
[__NR_eventfd] = (libos_syscall_t)libos_syscall_eventfd,
[__NR_fallocate] = (libos_syscall_t)libos_syscall_fallocate,
[__NR_timerfd_settime] = (libos_syscall_t)0, // libos_syscall_timerfd_settime
[__NR_timerfd_gettime] = (libos_syscall_t)0, // libos_syscall_timerfd_gettime
[__NR_timerfd_settime] = (libos_syscall_t)libos_syscall_timerfd_settime,
[__NR_timerfd_gettime] = (libos_syscall_t)libos_syscall_timerfd_gettime,
[__NR_accept4] = (libos_syscall_t)libos_syscall_accept4,
[__NR_signalfd4] = (libos_syscall_t)0, // libos_syscall_signalfd4
[__NR_eventfd2] = (libos_syscall_t)libos_syscall_eventfd2,
Expand Down
1 change: 1 addition & 0 deletions libos/src/fs/libos_fs.c
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ static struct libos_fs* g_builtin_fs[] = {
&synthetic_builtin_fs,
&path_builtin_fs,
&shm_builtin_fs,
&timerfd_builtin_fs,
};

static struct libos_lock g_mount_mgr_lock;
Expand Down
1 change: 1 addition & 0 deletions libos/src/fs/proc/thread.c
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,7 @@ static char* describe_handle(struct libos_handle* hdl) {
case TYPE_EPOLL: str = "epoll:[?]"; break;
case TYPE_EVENTFD: str = "eventfd:[?]"; break;
case TYPE_SHM: str = "shm:[?]"; break;
case TYPE_TIMERFD: str = "timerfd:[?]"; break;
default: str = "unknown:[?]"; break;
}
return strdup(str);
Expand Down
124 changes: 124 additions & 0 deletions libos/src/fs/timerfd/fs.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
/* SPDX-License-Identifier: LGPL-3.0-or-later */
/* Copyright (C) 2024 Intel Corporation
* Kailun Qin <[email protected]>
*/

/*
* This file contains code for implementation of 'timerfd' filesystem.
*/

#include "libos_fs.h"
#include "libos_handle.h"
#include "libos_internal.h"
#include "libos_lock.h"
#include "linux_abi/errors.h"
#include "pal.h"

static void timerfd_dummy_host_read(struct libos_handle* hdl, uint64_t* out_host_val) {
uint64_t buf_dummy_host_val = 0;
size_t dummy_host_val_count = sizeof(buf_dummy_host_val);

int ret = PalStreamRead(hdl->pal_handle, /*offset=*/0, &dummy_host_val_count,
&buf_dummy_host_val);
if (ret < 0 || dummy_host_val_count != sizeof(buf_dummy_host_val)) {
/* should not happen in benign case, but can happen under racing, e.g. threads may race on
* the same eventfd event, one of them wins and updates `dummy_host_val` and the other one
* looses and gets an unexpected `dummy_host_val` */
log_warning("timerfd dummy host read failed or got unexpected value");
return;
}

if (out_host_val)
*out_host_val = buf_dummy_host_val;
}

static ssize_t timerfd_read(struct libos_handle* hdl, void* buf, size_t count, file_off_t* pos) {
__UNUSED(pos);
assert(hdl->type == TYPE_TIMERFD);

if (count < sizeof(uint64_t))
return -EINVAL;

int ret;
spinlock_lock(&hdl->info.timerfd.expiration_lock);

while (!hdl->info.timerfd.num_expirations) {
if (hdl->flags & O_NONBLOCK) {
ret = -EAGAIN;
goto out;
}
/* must block -- use the host's blocking read() on a dummy eventfd */
if (hdl->info.timerfd.dummy_host_val) {
/* value on host is non-zero, must perform a read to make it zero (and thus the next
* read will become blocking) */
uint64_t host_val = 0;
timerfd_dummy_host_read(hdl, &host_val);
if (host_val != hdl->info.timerfd.dummy_host_val)
BUG();
hdl->info.timerfd.dummy_host_val = 0;
}

spinlock_unlock(&hdl->info.timerfd.expiration_lock);
/* blocking read to wait for some value (we don't care which value) */
timerfd_dummy_host_read(hdl, /*out_host_val=*/NULL);
spinlock_lock(&hdl->info.timerfd.expiration_lock);
hdl->info.timerfd.dummy_host_val = 0;
}

memcpy(buf, &hdl->info.timerfd.num_expirations, sizeof(uint64_t));
hdl->info.timerfd.num_expirations = 0;

/* perform a read (not supposed to block) to clear the event from writing/polling threads */
if (hdl->info.timerfd.dummy_host_val) {
timerfd_dummy_host_read(hdl, /*out_host_val=*/NULL);
hdl->info.timerfd.dummy_host_val = 0;
}

ret = (ssize_t)count;
out:
spinlock_unlock(&hdl->info.timerfd.expiration_lock);
maybe_epoll_et_trigger(hdl, ret, /*in=*/true, /*unused was_partial=*/false);
return ret;
}

static void timerfd_post_poll(struct libos_handle* hdl, pal_wait_flags_t* pal_ret_events) {
assert(hdl->type == TYPE_TIMERFD);

if (*pal_ret_events & (PAL_WAIT_ERROR | PAL_WAIT_HANG_UP)) {
/* impossible: we control eventfd inside the LibOS, and we never raise such conditions */
BUG();
}

spinlock_lock(&hdl->info.timerfd.expiration_lock);
if (*pal_ret_events & PAL_WAIT_READ) {
/* there is data to read: verify if counter has value greater than zero */
if (!hdl->info.timerfd.num_expirations) {
/* spurious or malicious notification -- for now we don't BUG but ignore it */
*pal_ret_events &= ~PAL_WAIT_READ;
}
}
if (*pal_ret_events & PAL_WAIT_WRITE) {
/* spurious or malicious notification */
BUG();
}
spinlock_unlock(&hdl->info.timerfd.expiration_lock);
}

static int timerfd_close(struct libos_handle* hdl) {
__UNUSED(hdl);

/* see `libos_timerfd.c` for the handle-open counterpart */
(void)__atomic_sub_fetch(&g_timerfd_cnt, 1, __ATOMIC_ACQ_REL);
return 0;
}

struct libos_fs_ops timerfd_fs_ops = {
.read = &timerfd_read,
.close = &timerfd_close,
.post_poll = &timerfd_post_poll,
};

struct libos_fs timerfd_builtin_fs = {
.name = "timerfd",
.fs_ops = &timerfd_fs_ops,
};
Loading

0 comments on commit 56310a1

Please sign in to comment.