You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation, after all data is received in a batch, raft saves the log entries into the log store and performs pre-commit. Then the end_of_append_batch step ensures that all data is written. However, if an IO error occurs between the save_log_entry and end_of_append_batch stages, the error may cause the HomeStore to become stuck (https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L454) or crash(https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L873).
For example:
t1: handle_raft_event and pass ( all data received)
t2: append log to log store and add rreq into state machine
t3: precommit pass
t4: async write failed at on_push_data_received, trigger handle_error / on_fetch_data_received and crash
t5: end_of_append_batch, wait for all data written, stuck or crash
Note that we cannot do unlink at handle_error if IO error occurs directly(in the following case), so we need to find a solutaion to handle error more gracefully, and one potential approach could be emergent gc.
t1: Handle the raft event and pass.
t2: Append log to the log store and save the LSN into the state machine.
t3: Fail to write due to an IO error, and then trigger handle error and remove it from the state machine.
t4: In the pre-commit phase, can't find the LSN, leading to a nullptr exception.
Also need flip to mock IO error.
The text was updated successfully, but these errors were encountered:
In the current implementation, after all data is received in a batch, raft saves the log entries into the log store and performs pre-commit. Then the end_of_append_batch step ensures that all data is written. However, if an IO error occurs between the save_log_entry and end_of_append_batch stages, the error may cause the HomeStore to become stuck (https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L454) or crash(https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L873).
For example:
t1: handle_raft_event and pass ( all data received)
t2: append log to log store and add rreq into state machine
t3: precommit pass
t4: async write failed at on_push_data_received, trigger handle_error / on_fetch_data_received and crash
t5: end_of_append_batch, wait for all data written, stuck or crash
Note that we cannot do unlink at handle_error if IO error occurs directly(in the following case), so we need to find a solutaion to handle error more gracefully, and one potential approach could be emergent gc.
t1: Handle the raft event and pass.
t2: Append log to the log store and save the LSN into the state machine.
t3: Fail to write due to an IO error, and then trigger handle error and remove it from the state machine.
t4: In the pre-commit phase, can't find the LSN, leading to a nullptr exception.
Also need flip to mock IO error.
The text was updated successfully, but these errors were encountered: