Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump and Reset stats data on demand using SIGUSR1 signal #1857

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
5fcbfed
Add a support for debug-stats dumping and resetting on demand
TejaswineeL Apr 23, 2024
14a8be3
fixup! Dump and Reset stats data on demand using SIGUSR1 signal
TejaswineeL Apr 29, 2024
dfb0310
fixup! fixup! Dump and Reset stats data on demand using SIGUSR1 signal
TejaswineeL Apr 29, 2024
b31b7d7
fixup! Dump and Reset stats data on demand using SIGUSR1 signal
TejaswineeL May 6, 2024
c876141
fixup! Dump and Reset stats data on demand using SIGUSR1 signal
TejaswineeL May 10, 2024
337d385
Merge branch 'master' into debug_stats_dump_reset
TejaswineeL May 13, 2024
c8317ba
fixup! Dump and Reset stats data on demand using SIGUSR1 signal
TejaswineeL May 14, 2024
003f356
fixup! Dump and Reset stats data on demand using SIGUSR1 signal
TejaswineeL May 16, 2024
3cb3bff
Add a support for debug-stats dumping and resetting on demand
TejaswineeL Jul 1, 2024
6da0092
fixup! Add a support for debug-stats dumping and resetting on demand
TejaswineeL Jul 2, 2024
7cdaacf
fixup! fixup! Add a support for debug-stats dumping and resetting on …
TejaswineeL Jul 3, 2024
26c4684
fixup! Add support for dumping and resetting debug stats on demand
TejaswineeL Jul 12, 2024
68f4f3a
fixup! Add a support for debug-stats dumping and resetting on demand
sreeharikax Jul 30, 2024
f09d9ec
fixup! Add a support for debug-stats dumping and resetting on demand
sreeharikax Jul 30, 2024
2966f33
fixup! Add a support for debug-stats dumping and resetting on demand
TejaswineeL Aug 14, 2024
5e4aaa1
fixup! Add a support for debug-stats dumping and resetting on demand
TejaswineeL Aug 14, 2024
b5fcb60
fixup! Add a support for debug-stats dumping and resetting on demand
Aug 19, 2024
8daa0cb
fixup! Add a support for debug-stats dumping and resetting on demand
Aug 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Documentation/manifest-syntax.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1213,6 +1213,10 @@ This syntax specifies whether to enable SGX enclave-specific statistics:
includes creating the enclave, adding enclave pages, measuring them and
initializing the enclave.

For this option to take effect, Gramine must be compiled with
``--buildtype=debug`` or ``--buildtype=debugoptimized``. Otherwise (if built in
release mode), Gramine will exit with an error.

.. warning::
This option is insecure and cannot be used with production enclaves
(``sgx.debug = false``). If a production enclave is started with this option
Expand Down
18 changes: 13 additions & 5 deletions Documentation/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ Enabling per-thread and process-wide SGX stats

See also :ref:`perf` below for installing ``perf``.

Enable statistics using ``sgx.enable_stats = true`` manifest option. Now your
graminized application correctly reports performance counters. This is useful
when using e.g. ``perf stat`` to collect performance statistics. This manifest
option also forces Gramine to dump SGX-related information on each
thread/process exit. Here is an example:
Enable statistics using ``sgx.enable_stats = true`` manifest option (note that
Gramine must be compiled with ``--buildtype=debug`` or
``--buildtype=debugoptimized`` for this option to work). Now your graminized
application correctly reports performance counters. This is useful when using
e.g. ``perf stat`` to collect performance statistics. This manifest option also
forces Gramine to dump SGX-related information on each thread/process exit. Here
is an example:

::

Expand Down Expand Up @@ -103,6 +105,12 @@ How to read this output:
counters should be compared against "golden runs" to deduce any interesting
trends.

It is also possible to dump and reset SGX-related statistics interactively, using
``SIGUSR1`` signal. This helps to collect SGX-related statistics only for a
particular period, e.g. skipping the Gramine startup and application
initialization time and concentrating only on the actual application processing.
Send ``SIGUSR1`` using command ``kill -SIGUSR1 -<PGID>``.

Effects of system calls / ocalls
--------------------------------

Expand Down
3 changes: 3 additions & 0 deletions libos/test/regression/fork_and_access_file.manifest.template
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ sgx.max_threads = {{ '1' if env.get('EDMM', '0') == '1' else '16' }}
sgx.debug = true
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}

# this is only to test that `sgx.enable_stats` works (it can only be specified for debug-mode tests)
sgx.enable_stats = true

sgx.trusted_files = [
"file:{{ gramine.libos }}",
"file:{{ gramine.runtimedir(libc) }}/",
Expand Down
1 change: 0 additions & 1 deletion libos/test/regression/multi_pthread.manifest.template
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ sgx.max_threads = {{ '1' if env.get('EDMM', '0') == '1' else '8' }}

sgx.debug = true
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}
sgx.enable_stats = true

sgx.trusted_files = [
"file:{{ gramine.libos }}",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ sgx.insecure__rpc_thread_num = 8

sgx.debug = true
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}
sgx.enable_stats = true

sgx.trusted_files = [
"file:{{ gramine.libos }}",
Expand Down
1 change: 0 additions & 1 deletion pal/regression/Thread2.manifest.template
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
loader.entrypoint = "file:{{ binary_dir }}/{{ entrypoint }}"

sgx.max_threads = {{ '1' if env.get('EDMM', '0') == '1' else '2' }}
sgx.enable_stats = true
sgx.debug = true
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}

Expand Down
1 change: 0 additions & 1 deletion pal/regression/Thread2_exitless.manifest.template
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ loader.entrypoint = "file:{{ binary_dir }}/{{ entrypoint }}"

sgx.max_threads = {{ '1' if env.get('EDMM', '0') == '1' else '2' }}
sgx.insecure__rpc_thread_num = 2
sgx.enable_stats = true
sgx.debug = true
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}

Expand Down
3 changes: 2 additions & 1 deletion pal/src/host/linux-sgx/enclave_ocalls.c
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,9 @@ static long sgx_exitless_ocall(uint64_t code, void* ocall_args) {
}
}

long result = COPY_UNTRUSTED_VALUE(&req->result);
sgx_reset_ustack(old_ustack);
return COPY_UNTRUSTED_VALUE(&req->result);
return result;
}

__attribute_no_sanitize_address
Expand Down
5 changes: 5 additions & 0 deletions pal/src/host/linux-sgx/host_entry.S
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

.extern tcs_base
.extern g_in_aex_profiling
.extern maybe_dump_and_reset_stats

.global sgx_ecall
.type sgx_ecall, @function
Expand Down Expand Up @@ -90,6 +91,8 @@ async_exit_pointer:
movq %rbx, %rdi
call sgx_profile_sample_aex

call maybe_dump_and_reset_stats

# Restore stack
movq %rbp, %rsp
.cfi_def_cfa_register %rsp
Expand Down Expand Up @@ -157,6 +160,8 @@ sgx_raise:
movq %rdx, %rdi
call sgx_profile_sample_ocall_inner

call maybe_dump_and_reset_stats

# Restore RDI
movq -8(%rbp), %rdi
#else
Expand Down
141 changes: 140 additions & 1 deletion pal/src/host/linux-sgx/host_exception.c
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,73 @@
* __sigset_t uc_sigmask;
*/


#include <linux/signal.h>
#include <stdbool.h>

#include "api.h"
#include "assert.h"
#include "cpu.h"
#include "debug_map.h"
#include "host_internal.h"
#include "host_syscall.h"
#include "pal_rpc_queue.h"
#include "pal_tcb.h"
#include "sigreturn.h"
#include "ucontext.h"

static const int ASYNC_SIGNALS[] = {SIGTERM, SIGCONT};

#ifdef DEBUG
/*
* If no SGX-stats reset is in flight, `g_stats_reset_leader_tid` is zero.
*
* Upon user-induced SIGUSR1 on some thread (below happens in signal handling context):
* 1. If `g_stats_reset_leader_tid == 0`, then it is set to the TID of this thread -- this thread
* is designated to be the "leader" of SGX-stats reset flow, and it will broadcast SIGUSR1 to
* all other threads on the first AEX or right-before executing the next OCALL in untrusted
* runtime (since we can't do any complex logic in signal handling context, we postpone it to
* the normal context).
* 2. If `g_stats_reset_leader_tid != 0`, then it means that an SGX-stats reset flow is in flight.
* Two cases are possible:
* a. If PID of sending process is the current PID, then the signal was sent by the "leader"
* and this thread is a "follower" -- it sets `reset_stats = true` in its TCB, so that
* this thread's statistics are dumped and reset on the next AEX or right-before
* executing the next OCALL in untrusted runtime.
* b. If PID of sending process is not the current PID, then the signal was sent by the user
* and this is a new "SGX-stats reset" event from the user. Since the previous flow is
* still in flight, the thread must ignore this signal.
*
* On each AEX and on each OCALL execution, each thread checks (below happens in normal context):
* 1. If `g_stats_reset_leader_tid == 0`, do nothing (no SGX-stats reset is in flight).
* 2. If `g_stats_reset_leader_tid == gettid()`, then this is the "leader" thread and it must
* broadcast SIGUSR1 to all other threads and wait until they perform their SGX-stats resets.
* After all threads are done, the "leader" resets `g_stats_reset_leader_tid` to zero.
* 3. Else, this is the "follower" thread and it must perform its SGX-stats reset.
*
* Application threads on Linux can never be 0, so this "no-op" default is safe.
*/
static int g_stats_reset_leader_tid = 0;

/*
* Each "SGX stats reset" is supposed to be executed in one epoch. Epoch is changed (i.e.
* `g_stats_reset_epoch` is atomically incremented) when any thread exits. If an "SGX stats reset"
* round detects that the epoch has changed before the leader thread got responses from all follower
* threads, this "SGX stats reset" round is aborted, see while loop in maybe_dump_and_reset_stats().
*
* This epoch mechanism is required to avoid data races:
* - If the leader thread is exited in the meantime (upon e.g. SIGTERM), the
* `g_stats_reset_leader_tid` variable would never be reset and future rounds would become
* impossible (all threads would think that some previous round is still in flight).
* - If some follower thread is exited in the meantime, the wait-for-all-followers loop in
* maybe_dump_and_reset_stats() would never break.
*
* Note that the epoch is *not* changed when a new thread is spawned, i.e. the "SGX stats reset"
* would successfully finish but without taking into account the newly spawned thread. This is a
* benign scenario, though the quality of SGX-stats reporting will be lower in this case.
*/
static uint32_t g_stats_reset_epoch = 0;
#endif /* DEBUG */

static int block_signal(int sig, bool block) {
int how = block ? SIG_BLOCK : SIG_UNBLOCK;

Expand Down Expand Up @@ -194,6 +247,27 @@ static void handle_sigusr1(int signum, siginfo_t* info, struct ucontext* uc) {
__UNUSED(info);
__UNUSED(uc);

if (g_sgx_enable_stats) {
int expected_tid = 0;
if (__atomic_compare_exchange_n(&g_stats_reset_leader_tid, &expected_tid,
DO_SYSCALL(gettid), /*weak=*/false,
__ATOMIC_ACQ_REL, __ATOMIC_RELAXED) == true) {
/* first thread that gets SIGUSR1, the CAS above designated it as the "leader" */
PAL_HOST_TCB* tcb = pal_get_host_tcb();
tcb->reset_stats = true;
} else {
/* thread gets SIGUSR1, check if this is a signal from the "leader" */
if (info->si_pid == g_host_pid) {
PAL_HOST_TCB* tcb = pal_get_host_tcb();
assert(!tcb->reset_stats);
tcb->reset_stats = true;
} else {
log_warning("Received SIGUSR1 from user, but there is another SGX-stats reset "
"in flight; ignoring it");
}
}
}

if (g_pal_enclave.profile_enable) {
__atomic_store_n(&g_trigger_profile_reinit, true, __ATOMIC_RELEASE);
}
Expand Down Expand Up @@ -274,3 +348,68 @@ void pal_describe_location(uintptr_t addr, char* buf, size_t buf_size) {
#endif
default_describe_location(addr, buf, buf_size);
}

#ifdef DEBUG
/* called on each AEX and OCALL (in normal context), see host_entry.S */
void maybe_dump_and_reset_stats(void) {
static size_t followers_visited_num = 0; /* note `static`, it is a global var */

if (!g_sgx_enable_stats)
return;

int leader_tid = __atomic_load_n(&g_stats_reset_leader_tid, __ATOMIC_ACQUIRE);
if (!leader_tid)
return;

PAL_HOST_TCB* tcb = pal_get_host_tcb();
if (!tcb->reset_stats)
return;

if (DO_SYSCALL(gettid) == leader_tid) {
log_always("----- DUMPING and RESETTING SGX STATS -----");
uint32_t epoch = __atomic_load_n(&g_stats_reset_epoch, __ATOMIC_ACQUIRE);

size_t followers_num = broadcast_signal_to_threads(SIGUSR1, /*exclude_tid=*/leader_tid);

while ((__atomic_load_n(&followers_visited_num, __ATOMIC_ACQUIRE)) < followers_num) {
DO_SYSCALL(sched_yield);
if (__atomic_load_n(&g_stats_reset_epoch, __ATOMIC_ACQUIRE) != epoch) {
log_warning("SGX stats reset (started due to SIGUSR1) was interrupted because at "
"least one thread exited in the meantime; stats may be incomplete");
break;
}
}

update_print_and_reset_stats(/*process_wide=*/true);
__atomic_store_n(&followers_visited_num, 0, __ATOMIC_RELEASE);

__atomic_store_n(&g_stats_reset_leader_tid, 0, __ATOMIC_RELEASE);
} else {
update_print_and_reset_stats(/*process_wide=*/false);
__atomic_fetch_add(&followers_visited_num, 1, __ATOMIC_ACQ_REL);
}
}

/* called when some thread exits -- a possible "SGX stats reset" round must be aborted, see above */
void abort_current_reset_stats(int exiting_tid) {
if (!g_sgx_enable_stats)
return;

/* make sure that an exiting thread does not receive SIGUSR1; this prevents a data race when
* this thread receives SIGUSR1, initiates a new "SGX stats reset" round and immediately exits,
* leaving `g_stats_reset_leader_tid` set to a dangling-tid value */
block_signal(SIGUSR1, /*block=*/true);

/* unconditionally increment the "SGX stats reset" epoch, reacting to every thread exit */
__atomic_fetch_add(&g_stats_reset_epoch, 1, __ATOMIC_ACQ_REL);

int leader_tid = __atomic_load_n(&g_stats_reset_leader_tid, __ATOMIC_ACQUIRE);
if (leader_tid == exiting_tid) {
/* unset leader, otherwise no other thread would be able to initiate "SGX stats reset"
* rounds in the future */
__atomic_store_n(&g_stats_reset_leader_tid, 0, __ATOMIC_RELEASE);
log_warning("SGX stats reset (started due to SIGUSR1) was aborted because initiating "
"thread exited; stats may be incomplete");
}
}
#endif /* DEBUG */
1 change: 1 addition & 0 deletions pal/src/host/linux-sgx/host_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ void thread_exit(int status);

int sgx_signal_setup(void);
int block_async_signals(bool block);
size_t broadcast_signal_to_threads(int sig, int exclude_tid);

int set_tcs_debug_flag_if_debugging(void* tcs_addrs[], size_t count);

Expand Down
8 changes: 8 additions & 0 deletions pal/src/host/linux-sgx/host_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -761,6 +761,14 @@ static int parse_loader_config(char* manifest, struct pal_enclave* enclave_info,
goto out;
}

#ifndef DEBUG
if (g_sgx_enable_stats) {
log_error("'sgx.enable_stats = true' is specified in non-debug mode, this is disallowed");
ret = -EINVAL;
goto out;
}
#endif /* !DEBUG */

ret = toml_string_in(manifest_root, "sgx.sigfile", &dummy_sigfile_str);
if (ret < 0 || dummy_sigfile_str) {
log_error("sgx.sigfile is not supported anymore. Please update your manifest according to "
Expand Down
4 changes: 2 additions & 2 deletions pal/src/host/linux-sgx/host_ocalls.c
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ static long sgx_ocall_exit(void* args) {

/* exit the whole process if exit_group() */
if (ocall_exit_args->is_exitgroup) {
update_and_print_stats(/*process_wide=*/true);
#ifdef DEBUG
update_print_and_reset_stats(/*process_wide=*/true);
sgx_profile_finish();
#endif

Expand All @@ -64,8 +64,8 @@ static long sgx_ocall_exit(void* args) {

if (!current_enclave_thread_cnt()) {
/* no enclave threads left, kill the whole process */
update_and_print_stats(/*process_wide=*/true);
#ifdef DEBUG
update_print_and_reset_stats(/*process_wide=*/true);
sgx_profile_finish();
#endif
#ifdef SGX_VTUNE_PROFILE
Expand Down
Loading