Skip to content

Commit

Permalink
Remove barriers and spinlock in epoch_enter and epoch_exit (#2796)
Browse files Browse the repository at this point in the history
* Remove barriers in enter and exit

Signed-off-by: Alan Jowett <[email protected]>

* PR feedback

Signed-off-by: Alan Jowett <[email protected]>

* PR feedback

Signed-off-by: Alan Jowett <[email protected]>

* Apply suggestions from code review

Co-authored-by: Dave Thaler <[email protected]>

* PR feedback

Signed-off-by: Alan Jowett <[email protected]>

* Apply suggestions from code review

Co-authored-by: Dave Thaler <[email protected]>

* PR feedback

Signed-off-by: Alan Jowett <[email protected]>

---------

Signed-off-by: Alan Jowett <[email protected]>
Co-authored-by: Alan Jowett <[email protected]>
Co-authored-by: Dave Thaler <[email protected]>
  • Loading branch information
3 people authored Sep 14, 2023
1 parent 01bc766 commit 4c3d2cd
Show file tree
Hide file tree
Showing 10 changed files with 799 additions and 596 deletions.
90 changes: 3 additions & 87 deletions docs/EpochBasedMemoryManagement.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,87 +42,8 @@ can only be returned to the OS once no active execution context could be
using that memory (i.e., when memory timestamp <=
_ebpf_release_epoch).

## Implementation details

Each execution context maintains its own state in the form of:

```
typedef struct _ebpf_epoch_state
{
int64_t epoch; // The highest epoch seen by this epoch state.
} ebpf_epoch_state_t;
```

Each execution context must first call ebpf_epoch_enter prior to accessing any
memory that is under epoch protection and then call ebpf_epoch_exit once it is
done. The call to ebpf_epoch_enter returns a pointer to an ebpf_epoch_state_t
object that must be passed to ebpf_epoch_exit. The epoch module maintains a
table of per-CPU epoch states, with an epoch state being assigned to an
execution context on ebpf_epoch_enter and returned on a call to ebpf_epoch_exit.
Threads running at passive IRQL will block if there are no available epoch
states and a thread running at dispatch IRQL will use a reserved epoch state.

Memory is then allocated via calls to ebpf_epoch_allocate which returns
memory with a private header and memory is freed via calls to
ebpf_epoch_free. The private header is then used to track when the
memory was freed as well as links. On free, the memory is stamped with
the current epoch and the current epoch is atomically incremented. This
ensures that the freed memory always maintains the correct epoch value.
The memory is then enqueued on a per-CPU free list. On epoch exit, the
free list is then scanned to locate entries whose timestamp is older than
the release epoch. These entries are then returned to the OS.

Note:
A per-CPU free list is not necessary, but is instead an optimization to reduce
cross-CPU contention.

```
// There are two possible actions that can be taken at the end of an epoch.
// 1. Return a block of memory to the memory pool.
// 2. Invoke a work item, which is used to free custom allocations.
typedef enum _ebpf_epoch_allocation_type
{
EBPF_EPOCH_ALLOCATION_MEMORY,
EBPF_EPOCH_ALLOCATION_WORK_ITEM,
} ebpf_epoch_allocation_type_t;
typedef struct _ebpf_epoch_allocation_header
{
ebpf_list_entry_t list_entry;
int64_t freed_epoch;
ebpf_epoch_allocation_type_t entry_type;
} ebpf_epoch_allocation_header_t;
```

Determining the release epoch is necessarily an expensive operation as
it requires scanning the epoch of every active execution context, with
execution contexts being protected by spinlocks. To limit the impact,
the epoch module uses a one-shot timer to schedule a DPC that computes
the release epoch by determining the minimum of all execution contexts'
epochs. The timer is then re-armed when an execution context calls
ebpf_epoch_exit. The result is that if no execution contexts are active,
the timer will expire and will not be re-armed.

## Exceptional cases

There are a few exceptional cases handled in the epoch module.

### Stale free lists

Memory that has been enqueued to an execution context can become stale
if the execution context calls ebpf_epoch_exit and there is memory in
the free list that hasn't reached the release epoch yet. If no further
calls are made to ebpf_epoch_enter/exit, then the memory will never be
freed. To address this, the timer will set a "stale" flag on an epoch
state each time it runs if there is memory in the free list and the
ebpf_epoch_exit will clear the flag. If the timer observes that the
epoch state is marked a stale (i.e., ebpf_epoch_exit hasn't been called
since the last invocation of the timer), then it will schedule a one-off
DPC to run in that execution context to flush the free list. The flush
then performs an ebpf_epoch_enter/exit, which permits any expired
entries in the free list to be freed.

### Work items
## Work items

In some cases code that uses the epoch module requires more complex
behavior than simply freeing memory on epoch expiry. To permit this
Expand All @@ -133,15 +54,10 @@ epoch). This is implemented as a special entry in the free list that
causes a callback to be invoked instead of freeing the memory. The callback
can then perform additional cleanup of state as needed.

### Future investigations
## Future investigations
The use of a common clock leads to contention when the memory state changes
(i.e., when memory is freed). One possible work around might be to move from a
clock driven by state change to one derived from a hardware clock. Initial
prototyping seems to indicate that the use of "QueryPerformanceCounter" and its
kernel equivalent are more expensive than using a state driven clock, but more
investigation is probably warranted.

The per-CPU lock does raise the cost of every ebpf_epoch_enter/exit operations
and it might be possible to implement a lock free schema for tracking epoch
state, but current attempts have resulted in various bugs where edge conditions
result in incorrect release epoch computations.
investigation is probably warranted.
2 changes: 1 addition & 1 deletion include/ebpf_extension.h
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ typedef struct _ebpf_attach_provider_data
*/
typedef struct _ebpf_execution_context_state
{
struct _ebpf_epoch_state* epoch_state;
uint64_t epoch_state[4];
union
{
uint64_t thread;
Expand Down
20 changes: 12 additions & 8 deletions libs/execution_context/ebpf_core.c
Original file line number Diff line number Diff line change
Expand Up @@ -2397,7 +2397,8 @@ ebpf_core_invoke_protocol_handler(
_In_opt_ void (*on_complete)(_Inout_ void*, size_t, ebpf_result_t))
{
ebpf_result_t retval;
ebpf_epoch_state_t* epoch_state = NULL;
ebpf_epoch_state_t epoch_state = {0};
bool in_epoch = false;
ebpf_protocol_handler_t* handler = &_ebpf_protocol_handlers[operation_id];
ebpf_operation_header_t* request = (ebpf_operation_header_t*)input_buffer;
ebpf_operation_header_t* reply = (ebpf_operation_header_t*)output_buffer;
Expand Down Expand Up @@ -2474,7 +2475,8 @@ ebpf_core_invoke_protocol_handler(
goto Done;
}

epoch_state = ebpf_epoch_enter();
ebpf_epoch_enter(&epoch_state);
in_epoch = true;
retval = EBPF_SUCCESS;

switch (handler->call_type) {
Expand Down Expand Up @@ -2537,18 +2539,19 @@ ebpf_core_invoke_protocol_handler(
}

Done:
if (epoch_state) {
ebpf_epoch_exit(epoch_state);
if (in_epoch) {
ebpf_epoch_exit(&epoch_state);
}
return retval;
}

bool
ebpf_core_cancel_protocol_handler(_Inout_ void* async_context)
{
ebpf_epoch_state_t* epoch_state = ebpf_epoch_enter();
ebpf_epoch_state_t epoch_state = {0};
ebpf_epoch_enter(&epoch_state);
bool return_value = ebpf_async_cancel(async_context);
ebpf_epoch_exit(epoch_state);
ebpf_epoch_exit(&epoch_state);
return return_value;
}

Expand All @@ -2559,12 +2562,13 @@ ebpf_core_close_context(_In_opt_ void* context)
return;
}

ebpf_epoch_state_t* epoch_state = ebpf_epoch_enter();
ebpf_epoch_state_t epoch_state = {0};
ebpf_epoch_enter(&epoch_state);

ebpf_core_object_t* object = (ebpf_core_object_t*)context;
EBPF_OBJECT_RELEASE_REFERENCE_INDIRECT((&object->base));

ebpf_epoch_exit(epoch_state);
ebpf_epoch_exit(&epoch_state);
}

_Must_inspect_result_ ebpf_result_t
Expand Down
8 changes: 4 additions & 4 deletions libs/execution_context/ebpf_link.c
Original file line number Diff line number Diff line change
Expand Up @@ -474,11 +474,11 @@ static ebpf_result_t
_ebpf_link_instance_invoke_batch_begin(
_In_ const void* client_binding_context, size_t state_size, _Out_writes_(state_size) void* state)
{
ebpf_execution_context_state_t* execution_context_state = (ebpf_execution_context_state_t*)state;
bool epoch_entered = false;
bool provider_reference_held = false;
ebpf_result_t return_value;
ebpf_link_t* link = (ebpf_link_t*)client_binding_context;
ebpf_execution_context_state_t* execution_context_state = (ebpf_execution_context_state_t*)state;

if (state_size < sizeof(ebpf_execution_context_state_t)) {
return_value = EBPF_INVALID_ARGUMENT;
Expand All @@ -491,7 +491,7 @@ _ebpf_link_instance_invoke_batch_begin(
goto Done;
}

((ebpf_execution_context_state_t*)state)->epoch_state = ebpf_epoch_enter();
ebpf_epoch_enter((ebpf_epoch_state_t*)(execution_context_state->epoch_state));
epoch_entered = true;

return_value = ebpf_program_reference_providers(link->program);
Expand All @@ -511,7 +511,7 @@ _ebpf_link_instance_invoke_batch_begin(
}

if (return_value != EBPF_SUCCESS && epoch_entered) {
ebpf_epoch_exit(((ebpf_execution_context_state_t*)state)->epoch_state);
ebpf_epoch_exit((ebpf_epoch_state_t*)(execution_context_state->epoch_state));
}

return return_value;
Expand All @@ -524,7 +524,7 @@ _ebpf_link_instance_invoke_batch_end(_In_ const void* extension_client_binding_c
ebpf_link_t* link = (ebpf_link_t*)extension_client_binding_context;
ebpf_assert_success(ebpf_state_store(ebpf_program_get_state_index(), 0, execution_context_state));
ebpf_program_dereference_providers(link->program);
ebpf_epoch_exit(execution_context_state->epoch_state);
ebpf_epoch_exit((ebpf_epoch_state_t*)(execution_context_state->epoch_state));
return EBPF_SUCCESS;
}

Expand Down
14 changes: 8 additions & 6 deletions libs/execution_context/ebpf_program.c
Original file line number Diff line number Diff line change
Expand Up @@ -2194,7 +2194,8 @@ _ebpf_program_test_run_work_item(_In_ cxplat_preemptible_work_item_t* work_item,
uintptr_t old_thread_affinity;
size_t batch_size = options->batch_size ? options->batch_size : 1024;
ebpf_execution_context_state_t execution_context_state = {0};
ebpf_epoch_state_t* epoch_state = NULL;
ebpf_epoch_state_t epoch_state = {0};
bool in_epoch = false;
bool irql_raised = false;
bool thread_affinity_set = false;
bool state_stored = false;
Expand All @@ -2208,7 +2209,8 @@ _ebpf_program_test_run_work_item(_In_ cxplat_preemptible_work_item_t* work_item,
old_irql = ebpf_raise_irql(context->required_irql);
irql_raised = true;

epoch_state = ebpf_epoch_enter();
ebpf_epoch_enter(&epoch_state);
in_epoch = true;

ebpf_get_execution_context_state(&execution_context_state);
return_value =
Expand All @@ -2228,7 +2230,7 @@ _ebpf_program_test_run_work_item(_In_ cxplat_preemptible_work_item_t* work_item,
// Start a new epoch every batch_size iterations.
if (!batch_counter) {
batch_counter = batch_size;
ebpf_epoch_exit(epoch_state);
ebpf_epoch_exit(&epoch_state);
if (ebpf_should_yield_processor()) {
// Compute the elapsed time since the last yield.
end_time = ebpf_query_time_since_boot(false);
Expand All @@ -2245,7 +2247,7 @@ _ebpf_program_test_run_work_item(_In_ cxplat_preemptible_work_item_t* work_item,
// Reset the start time.
start_time = ebpf_query_time_since_boot(false);
}
epoch_state = ebpf_epoch_enter(epoch_state);
ebpf_epoch_enter(&epoch_state);
}
ebpf_program_invoke(context->program, context->context, &return_value, &execution_context_state);
}
Expand All @@ -2262,8 +2264,8 @@ _ebpf_program_test_run_work_item(_In_ cxplat_preemptible_work_item_t* work_item,
ebpf_assert_success(ebpf_state_store(ebpf_program_get_state_index(), 0, &execution_context_state));
}

if (epoch_state) {
ebpf_epoch_exit(epoch_state);
if (in_epoch) {
ebpf_epoch_exit(&epoch_state);
}

if (irql_raised) {
Expand Down
Loading

0 comments on commit 4c3d2cd

Please sign in to comment.