Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to define rack topology aware configuration #1801

Open
andrey-dubnik opened this issue Mar 27, 2023 · 8 comments
Open

Option to define rack topology aware configuration #1801

andrey-dubnik opened this issue Mar 27, 2023 · 8 comments

Comments

@andrey-dubnik
Copy link
Contributor

Hi,

Maybe a newbie question and there is already a perfect way to configure a rack aware topology...

Currently I can't see a way to specify rack topology for the cluster node deployment meaning that even for the multiple data copies there is a risk of the data copy placement into the nodes within the same availability zone (AZ). Affinity block can make sure nodes are distributed across the zones but there is no zonal topology awareness other than the node.name and the cluster.name.

It would be great if Infinispan operator allowed for the Rack grouping of nodes. The use case for it would be running the nodes across different failure domains within the same cluster so the data is guaranteed to be replicated into a different availability zone.

Example on how this is implemented in k8ssandra operator where k8s labels are associated with the racks. This configuration results in the nodes provisioned with the rack attributes configured and required node affinity block (generated for the actual deployment) makes sure the PODs scheduled for the rack configuration are only placed into the nodes with the corresponding labels.

      apiVersion: k8ssandra.io/v1alpha1
      kind: K8ssandraCluster
      ...
      spec:
        cassandra:
              ...
              racks:
              - name: az-1
                nodeAffinityLabels:
                  cassandra-rack: az1
              - name: az-2
                nodeAffinityLabels:
                  cassandra-rack: az2
              - name: az-3
                nodeAffinityLabels:
                  cassandra-rack: az3
@andrey-dubnik andrey-dubnik changed the title Option to define multiple rack configuration Option to define rack topology aware configuration Mar 27, 2023
@ryanemerson
Copy link
Contributor

@andrey-dubnik How many Infinispan replicas do you want to reside in each availability zone? If a single replica is sufficient, then it should be possible to use the anti-affinity configuration to achieve this:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: infinispan-pod
            clusterName: <cluster_name>
            infinispan_cr: <cluster_name>
        topologyKey: "topology.kubernetes.io/zone"

https://infinispan.org/docs/infinispan-operator/main/operator.html#anti-affinity-strategies_availability

For multiple replicas in each of the availability zones, you could replicate the rack aware topology you describe by having multiple Infinispan clusters. Each cluster would have it's affinity settings configured so that scheduled pods reside in their given AZ, and it's site backups are the clusters in the other AZs.

@andrey-dubnik
Copy link
Contributor Author

andrey-dubnik commented Mar 28, 2023

@ryanemerson indeed if the total instance count is 3 then we are good with the affinity configuration, the problem surfaces itself when the cluster grows beyond 3 nodes in the distributed cache topology. Having multiple clusters can be an option but it also introduces an overhead of managing the cache configurations and proto schemas across all the replication participating clusters + handling the client connections etc.

Another potential problem is replicated setup is exactly that so it does not enable the data distribution and each replicated cluster is going to contain a full data copy which makes it not very optimal scenario as for the case of a very large cluster we are going to replicate the memory footprint 3 times over.

If rack option is available for the Infinispan it could be beneficial to use it as it simplifies the cluster management, reduces the cost of infrastructure and increases the availability when working within the k8s.

reg. the data replica count I was thinking to keep 2-3 replicas to cover all AZs + add more nodes when we need to increase a cluster capacity.

@ryanemerson
Copy link
Contributor

ryanemerson commented Mar 28, 2023

@andrey-dubnik Infinispan already provides Server Hinting to ensure that data is replicated appropriately amongst the replicas, however this isn't currently utilised by the Operator as there's a performance concern ISPN-12505 that needs to be addressed. If implemented and combined with the appropriate affinity configuration, would this satisfy your requirements or do you have a further need to explicitly state where individual pods should reside?

@andrey-dubnik
Copy link
Contributor Author

@ryanemerson I was thinking of using TopologyAwareSyncConsistentHashFactory accordingly to my understanding of the documentation statement When Infinispan distributes the copies of your data, it follows the order of precedence: site, rack, machine, and node. on how Infinispan would distribute the data. If topology is rack aware then Infinispan will try its best to prioritise putting the data into different racks and is k8s rack can be the availability zone, a node pool etc.

When it comes to the Server Hinting my understanding is hinting equals to the data pining and if I specify rack in will stick it to a specific rack ID and won't distribute across multiple racks. If my understanding is correct this won't likely achieve the desired outcome of placing the data copies into different availability zones.

Is my understanding correct? If it is then hinting won't likely help if cluster node is not aware of the rack.

@ryanemerson
Copy link
Contributor

When it comes to the Server Hinting my understanding is hinting equals to the data pining and if I specify rack in will stick it to a specific rack ID and won't distribute across multiple racks. If my understanding is correct this won't likely achieve the desired outcome of placing the data copies into different availability zones.

It's the other way around. With server hinting, if the "rack" field is specified on the individual pods and multiple racks exist, then the hash ensures that the primary and backup replica(s) for a given segment are distributed across distinct racks.

@andrey-dubnik
Copy link
Contributor Author

andrey-dubnik commented Mar 28, 2023

@ryanemerson alright - so the Server Hinting actually describes the node topology and if each node is having a different cache-container transport hint Infinispan would account for that when distributing the data. Will there be an option to source the Server Hinting data from the k8s labels?

I would like to achieve following outcome

  • be able to schedule cluster PODs across multiple node pools as we found cloud providers not always schedule workload evenly across AZs so we have opted to have a node pool per AZ created to get 100% consistency in the server configuration. In the case of k8ssandra combination of racks and node count provides desired outcome as node is assigned to the rack in in cycle. e.g. 4 nodes would put it to az-1,2,3,1
  • be able to make the cluster about the node topology, in this case the rack name

@ryanemerson
Copy link
Contributor

I think we'll need to provide two things to satisfy your requirements:

  1. Allow the names "racks" to be defined on the Infinispan CR itself, with pods being assigned cyclically like you describe for Cassandra. On creation, each individual pod then has a corresponding infinispan.org/rack=... label and this value is used to configure the "rack" attribute of the JGroups transport.

  2. Allow users to be able to define the TopologyPodConstraints that's applied by the StatefulSet. This should allow users to reference the "rack" label defined in 1 and ensure that Pods are scheduled as required by their use-case.

@andrey-dubnik
Copy link
Contributor Author

andrey-dubnik commented Mar 28, 2023

@ryanemerson this looks like it

There may be 1 nice to have feature - each rack to run in a dedicated stateful set, e.g. <cluster-name>-az-1,2,3 maintaining the affinity label rules for the rack placement.

This may be needed as with a current affinity scheduling there is no 100% guarantee for the workload to be distributed across AZs, although almost always nodes are going to be created in a cycling fashion across AZs by the cloud provider within a single AZ aware node pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants