Improve SDK concurrency and fault toleracy #132

beligante · 2024-09-20T20:12:48Z

Is your feature request related to a problem? Please describe.
Currently, we have the Erlang SDK running at scale in our services, tho, recently, our team start to face, what we believe to be an issue with the increased amount of feature flag evaluations we run in runtime. Our team started to notice some small crashes that happen with the SDK when some of the processes are too busy, either processing the messages on mailbox or communicating with LD APIs.

Most of the problems start with this crash:

So, what we observe here is (I'm using the module names for better reference) ldclient_event_server is attempting to synchronously talk to ldclient_event_process_server, tho that process is busy (and our guess is that it's sending a batch of events to LD server, but there's a huge stack of messages on the mailbox) and the call timeout, crashing the ldclient_event_server process. Okay, so, what's the main problem here?

ldclient_event_server is a singleton process, which means that, if it crashes, all other processes that are attempting to send messages to it will crash because the process no longer exists.

This is the root cause of the problem we're trying to solve here. What we expect from the SDK team is to handle these situations more gracefully, but not only that we would like to make some more suggestions.

Describe the solution you'd like

Make ldclient_event_server handle the genserver timeouts more gracefully OR make sure that ldclient_event_process_server:get_last_server_time is implemented in a way that it's concurrently accessible, without the need to exchange messages between processes. If this is not possible for some criteria, maybe create an abstraction on top of this, in a way that we can choose the fidelity of the messages
Make ldclient_event_process_server and ldclient_event_server a pool of processes. Currently these are single process, that are both responsible for provide and process data when evaluating all feature flags. In a highly concurrent application, these processes are for sure bottlenecks -- Specially if one of them can get blocked to synchronize data with the server
If send_events is disabled, make sure that when ldclient.variation is called, don't exchange messages with any other processes that is not meant to, this just adds latency to the flag evaluation and reduce the concurrency.
- To be more explicit, when the function is called, a message is sent to ldclient_event_server, that message is processed with still process by that genserver

Describe alternatives you've considered
Currently, our workaround in production environment was to disable the events synchronization with the server, which is not ideal as we would like to have the evaluation graphs populated.

The text was updated successfully, but these errors were encountered:

kinyoklion · 2024-09-20T20:38:23Z

Hello @beligante,

Thank you for the feedback. In regards to the get_last_server_time, I think we can make this concurrent without issue, and I will make a task to do that. (Filed internally as SDK-677)

There is the potential that ldclient_event_process_server could be a pool at some point, but unlikely ldclient_event_server would be. Part of the purpose of the structure is managing the volume of events transmitted. The ldclient_event_server event server only manages a queue with a max capacity and the batches are flushed asynchronously to this process. It could be blocked less by more flushing processes, but we can remove the blocking behavior all together by removing the relationship with get_last_server_time. If we make other changes, and there are still scaling issues, then we would consider having more processes for flushing.

We will also look into handling event disablement without requiring a message exchange. (Filed internally as SDK-678)

Thank you,
Ryan

lucaspolonio · 2024-10-16T23:03:30Z

hey @kinyoklion , do you have any updates on your internal tickets? thanks!

kinyoklion · 2024-10-16T23:05:46Z

Hello @lucaspolonio,

There are not any updates currently. We've not forgotten about this. We just have a number of projects to work through.

Thank you,
Ryan

kinyoklion · 2024-10-21T15:34:32Z

@lucaspolonio I've put an initial version of the first change for this in PR. It uses ets for get_last_server_time to make it available concurrently.

kinyoklion · 2024-10-21T17:29:08Z

Hey @lucaspolonio,

I also have a question. For typical flag configurations we shouldn't be reading the last server time at all. Theoretically it should only happen when you are viewing full fidelity events using the event debugger, or potentially if you have used the API to enable debugging for a specific flag. Are you specifically enabling debug events for a large set of flags?

-spec should_add_debug_event(ldclient_event:event(), Tag :: atom()) -> boolean().
should_add_debug_event(#{data := #{debugEventsUntilDate := null}}, _Tag) -> false;
should_add_debug_event(#{data := #{debugEventsUntilDate := DebugDate}}, Tag) ->
    LastServerTime = ldclient_event_process_server:get_last_server_time(Tag),
    Now = erlang:system_time(milli_seconds),
    (DebugDate > Now) and (DebugDate >  LastServerTime).

~~So, potentially, there is something else going on causing this code path to be engaged more than desired.~~ (From initial inspection and testing there doesn't look to be an issue here.)

Thank you,
Ryan

Addresses main bottleneck of #132.

kinyoklion · 2024-10-24T20:14:41Z

The non-blocking changes for last-known-server-time have been released in 3.4.0.

Thank you,
Ryan

beligante · 2024-11-27T15:05:09Z

Hello @kinyoklion

Thanks for the quick support on the non-blocking changes, this definitely helped, tho it was not enough to handle all the load we have for events. The bottleneck now is on the ldclient_event_server_default , which is a synchronous call.

My original thought is to having that process becoming a pool of process to accept the events to be sent and at some point you can conciliate them or change that to a cast to make it asynchronous.

kinyoklion mentioned this issue Oct 18, 2024

feat: Use ets for last-known server-time. #136

Merged

kinyoklion added a commit that referenced this issue Oct 22, 2024

feat: Use ets for last-known server-time. (#136)

568ac51

Addresses main bottleneck of #132.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SDK concurrency and fault toleracy #132

Improve SDK concurrency and fault toleracy #132

beligante commented Sep 20, 2024 •

edited

Loading

kinyoklion commented Sep 20, 2024

lucaspolonio commented Oct 16, 2024

kinyoklion commented Oct 16, 2024

kinyoklion commented Oct 21, 2024

kinyoklion commented Oct 21, 2024 •

edited

Loading

kinyoklion commented Oct 24, 2024

beligante commented Nov 27, 2024

Improve SDK concurrency and fault toleracy #132

Improve SDK concurrency and fault toleracy #132

Comments

beligante commented Sep 20, 2024 • edited Loading

kinyoklion commented Sep 20, 2024

lucaspolonio commented Oct 16, 2024

kinyoklion commented Oct 16, 2024

kinyoklion commented Oct 21, 2024

kinyoklion commented Oct 21, 2024 • edited Loading

kinyoklion commented Oct 24, 2024

beligante commented Nov 27, 2024

beligante commented Sep 20, 2024 •

edited

Loading

kinyoklion commented Oct 21, 2024 •

edited

Loading