-
Notifications
You must be signed in to change notification settings - Fork 97
Synchronization Internals
Synchronization kicks in when a node joins or leaves the xenon node-group. Synchronization establishes the latest state and ownership of services in a node-group and replicates the state to all relevant peer nodes.
Synchronization only applies to Stateful services that use the REPLICATION ServiceOption. Xenon supports synchronization for both PERSISTED and NON-PERSISTED (in-memory) services.
This page covers the process and internals of Synchronization in-detail. It will be useful for readers to refresh themselves with the following prerequisites:
Synchronization gets kicked-off when a node joins or leaves the xenon node group (Figure A below). Technically speaking, synchronization is triggered whenever a node-group converges after going through a change. The process of node-group convergence is described in more detail here: Node-Group Convergence.
Each node in the node-group runs synchronization and goes through all factory services to determine if the node is the synchronization OWNER for that factory (Figure B below). Ownership here is determined by consistent-hashing through the factory self-link. After each node has established independently the factory services it owns, it tries to establish consensus by contacting other nodes in the node-group to ensure that everybody else also considers this node the Synchronization owner for the specific factory Service. This is done to avoid scenarios where two nodes may be running synchronization for the same Factory service. If however, two nodes conflict each other on ownership, this usually indicates that the node-group is going through flux and a later node-group convergence event will resolve the conflict and re-trigger synchronization.
Once ownership has been established, each node starts synchronizing child services for the factories it owns. Figure C below indicates that Node-C has been selected as the Synchronization owner for Examples Service i.e. /core/examples.
To synchronize the child services, each node kicks off an instance of the SynchronizationTaskService per factory. The Synchronization Task starts by making a broadcast query to compute the union of all documentSelfLinks for the child services of that factory (Figure D below). The broadcast query ensures that each child service known to any node in the node-group is considered during the synchronization process.
After computing the child service documentSelfLinks for the factory, the synchronization task starts making SYNCH-POST requests to owner nodes of these child services. SYNCH-POST is a HTTP POST request tagged with a specific pragma header to indicate to the child service owner node that it needs to synchronize this child service. Figure E indicates that Node "C" sent SYNCH-POST requests to A and E for child1 and child2, respectively.
The requests are forwarded to the owner nodes to ensure that the synchronization request goes through the same Operation processing pipeline that is used to process any other active updates to the child service. This increases reliability and reduces chances of conflict because of different nodes processing updates for the same service.
Also, notice that even though Node "E" just joined the node-group and will most likely not have any state for the child2, still it receives the SYNCH-POST request. This is because child service synchronization always considers the latest state of the child service from all peer nodes in the node-group. This will become more clear in the coming discussion.
Once the child service owner has received the SYNCH-POST request, it starts processing it by making a broadcast query to all peer nodes to get the latest version of the Service (Figure F). This is done through the NodeSelectorSynchronizationService
After the child service owner has received the latest state from it's peers, it will go over the results to compute the latest state. The process of computing latest state is critical for Xenon to recover service state reliably in the face of errors and network partitions. To do this, Xenon uses documentVersion and documentEpoch per Service Document. The documentVersion field in a monotonically increasing number indicating the number of updates the local service instance has processed since creation. The documentEpoch field is also a monotonically increasing number that gets incremented each time a child service owner changes. Given the above details, the algorithm to compute best state is as follows:
-
Select the documents from peer nodes that use the highest documentEpoch.
-
From the filtered documents, select the documents with the highest documentVersion.
-
If there are more than one documents using the highest documentEpoch and highest documentVersion, then xenon currently picks one of the documents randomly. Although, we could still have scenarios because of Network splits that two different documents may exist with the same epoch and document version, Xenon currently does not solve this problem.
Once the owner node has selected the latest state of the child service, it will broadcast the new state to all peer nodes in the node-group (Figure g). Also, if the latest state is different than the local state of the owner node, it will update it's state.
After each child-service has finished synchronizing, the SynchronizationTaskService will complete execution and mark the Factory service as AVAILABLE (Figure h).
The above approach for Synchronization provides the following benefits:
- Synchronization work-load gets uniformly distributed across all nodes in the node-group. This increases concurrency and reduces the total amount of time taken for sychronization.
- Each child service synchronization request goes through the owner nodes. This ensures that any in-flight update requests for the child services get serialized without running into state conflicts.
- The latest state of each child service gets computed and replicated to all peer nodes thus ensuring that the entire state of node-group becomes consistent.
The process of synchronization also gets invoked in scenarios other than node-group changes. These include:
-
Node Restarts: Generally speaking, for replicated services a node restart is no different than a node leaving and joining a Xenon node-group. However, for non-replicated services, xenon still needs to make sure that all locally stored child services get start. To handle this scenario, Xenon uses the same Synchronization process, with the caveat that the SYNCH-POST request in Figure d (above) always gets sent to the local node. Also the local node skips steps outlined in Figures f and g to avoid querying other nodes for the latest state and just considers the local state to start the service.
-
On-demand Synchronization: It is possible that while the synchronization process was running a child service that has not been synchronized yet, receives an update request (PATCH, PUT or DELETE). Instead of blocking the write request or failing it, Xenon detects that the local service needs to be synchronized at that time and kicks-off synchronization right away i.e. steps outlined by Figures f and g. After the service has been synchronized and the owner node has the latest state, xenon replays the original write request.
-
Synchronizing OnDemandLoad Services: OnDemandLoad Services in Xenon only get started as a result of user requests. If not accessed, Xenon stops these services to reduce the memory footprint of the Xenon host. Synchronization for these services is a little complicated such that these services never get synchronized as part of node-group changes. Instead, when an OnDemandLoad service is requested, at that time very similar to On-demand synchronization, Xenon performs synchronization for that child-service and then replays the original write request.
As discussed earlier, the SynchronizationTaskService is responsible for orchestrating synchronization of all child-services for a given FactoryService. The SynchronizationTaskService is a Stateful, in-memory service. An instance of the task is created per FactoryService. The FactoryService itself creates this instance when the FactoryService starts, usually during host start-up. The created task instance represents a never expiring service.
The SynchronizationTaskService represents a long-running preemptable task. Preemptability is important here because while the task is running, a new node-group change event may arrive that will require either the task to get cancelled or to get restarted. The task should be cancelled because the node is no longer the synchronization OWNER for the Factory Service. The task will get restarted if the node-group configuration changed but the node continues to be the owner of the FactoryService. The below diagram describes the state-machine of the SynchronizationTaskService.
a) The FactoryService at start-up time creates the synchronization-task in CREATED state.
b) The Synchronization-task receives a PUT request everytime a node-group change event occurs. This will move the task from CREATED to STARTED state and QUERY sub-stage
c) In the QUERY sub-stage, the Synchronization-task does a broadcast query to all peer nodes to determine the union of documentSelfLinks for that factory service. The query performed is paginated. If there are no results returned, the Synchronization-task jumps to the FINISHED state.
d) If we do find child-services, the Synchronization-task moves to the next sub-stage SYNCHRONIZE. In the SYNCHRONIZE sub-stage the task sends out SYNCH-POST requests to all OWNER nodes as discussed in the previous section.
e) After the task has sent SYNCH-POST requests for each child-service self-link in the current page, it checks if there are more pages to process for the broadcast query ran in step c.
f) If there are no more pages to process, the task jumps to the FINISHED state.
g) As discussed earlier, the synchronization-task is restartable for new node-group change events. If a new node-group change event occurs, the synch-task moves back to STARTED-QUERY stage.
h) It is possible that while the task is STARTED, a new node-group change event may occur and request synchronization. In that case, the PUT request resets the task back to RESTART sub-stage. The next time the task self-patches it detects that it was reset and the task ends up restarting it-self by going to the STARTED-QUERY stage.
While the node-group converges, it is possible that multiple node-group change events can occur, and sometimes even in the wrong order. So it is critical to only restart the Synchronization-task if the node-group change event it just received is actually newer. To detect this, the Synchronization-task uses the membershipUpdateTimeMicros property that is an increasing number set through NodeGroupService and acts as a version number of the node-group change event. The SynchronizationTaskService stores the membershipUpdateTimeMicros that caused the task to trigger. Only if the incoming request is for a higher membershipUpdateTimeMicros, the task will get reset to RESTART sub-stage.
While the synchronization-task is running on one node, a node-group change event can occur that ends up making a different node the owner for the FactoryService. When this happens there is a chance that the new owner and the old owner both start running the synchronization for the same FactoryService. To avoid the old owner to keep executing synchronization, the task re-checks ownership every time it starts processing a new broadcast query result page. If the task discovers that it is no longer the owner, the task cancels itself.
The above discussion covers the synchronization process and it's design. For developers who would like to further dig into the code, the below code-map has been added as a guide while browsing through the code.