Cassandra network architecture

Cassandra

Cassandra is an open source, column-oriented, distributed database designed to handle large volumes of data spread across multiple nodes while ensuring high availability, scalability.

Reduce the number of flexible nodes (Elastic Scalability) and accept some errors (Fault Tolerant). It was developed by Facebook and is still growing and used for the largest social network in the world. In 2008, Facebook passed it over to the open source community and continues to grow by Apache today. Cassandra is considered a combination of Amazon’s Dynamo and Google’s BigTable. The server nodes in the Cassandra cluster are identical in peer-to-peer design, without any system components being a bottle-neck failure.

Design of Cassandra is designed dispersed across thousands of servers without any dead spots yet focused. Cassandra is designed based on a Peer – to – Peer architecture all of the server nodes in the system have the same role and no server node acts as the central server. This server failure can lead to a complete system shutdown like traditional host-guest architectures.

Cassandra’s server nodes are independent and engage in connection with other server nodes in the system. Each node can handle data read and write operations, regardless of whether the data is physically stored on which server in the system.
When a node in the system fails and stops working, read and write operations can be handled by other nodes in the system. This process is completely transparent to the application, allowing the system failure to be hidden for those applications.

In Cassandra, each data object can be cloned and stored on multiple servers. If one of the servers that holds a data version is corrupted or is not the latest updated version, Cassandra has a sync mechanism to ensure read operations always return the latest data. . This mechanism is implemented in the read repair process instead of synchronizing in the write operation, this allows for increased performance in data writing operation.

Data scattering in Cassandra

Cassandra uses a Distributed consistent hashing mechanism that organizes server nodes in a circular format and the data is distributed around this circle according to a consistent hash function. Each circle is considered a Datacenter.

Nodes in a Cassandra cluster will be distributed on a ring called a ring (Figure above). Each node will be assigned a key value, Cassandra uses 127 bits to generate this key. Each node in the ring manages a value range of keys. The range of the determined key spreads evenly from the value of the node itself holds, going counter-clockwise until it reaches the first node then stops. Looking at Figure 8, we will see that the range of keys that node T-1 manages is in the range (T-0; T-1]
When a record is written to the Cassandra cluster. will be passed through a consistent hash function, which returns a 127bit key value, which is in the control of which node the record will be written to that node.

For example, we have the values ​​on the name fields hashed as follows:

Partition Key Hash value
Jim -2245452657672322382
Carol 7723358928203680754
Johnny -6756552657672322382
Suzy 1168658928203680754


With Cassandra, we have two fragmentation strategies (determining the position of each node in the ring)
Random partitioning: This is Cassandra’s default and recommended strategy, the location of nodes is determined entirely through MD5 hash array. Key ranges range from 0 to 2 ^ 127- 1
Ordered partitioning: The fragmentation strategy ensures that the nodes are arranged in the same order and the range of keys each node owns is the same.
With the first fragmentation strategy, if the output hash values ​​match nodes, all records will be distributed evenly across the cluster. Adding or removing each node from the cluster is also easier by not having to redistribute the other nodes.

With the second fragmentation strategy, when the nodes are evenly distributed and the key management scope is the same, those have quite obvious disadvantages:
Difficulty in balance in the cluster: Every time adding or removing One node out of the cluster, the administrator would have to manually rebalance the cluster to ensure the nodes were evenly distributed.
If the data is written sequentially, it may happen that batch of data is written to a node. Causes an imbalance in the cluster.

Comment: With both of the above fragmentation strategies, weaknesses are still revealed, when the number of nodes in the ring is too small, or the nodes are unevenly distributed according to the hash value of the inserted records, it is very easy to put in. to imbalance, overload in the cluster. In addition, when adding or removing a node from a ring, it takes a lot of cluster rebalancing

Virtual node

To solve this problem, we have a solution that is to use virtual nodes. The virtual node looks like a component of the circle in the system, but in essence the virtual node is just the mapping of one physical node to another address in the ring. When data enters the virtual node’s management area, it is stored at the virtual node’s physical node. Each physical node when participating in the ring will be assigned a position of that node and assigned a number of other positions (considered as the virtual node of that node). Cassandra configures by default each ring join node will be assigned 256 virtual nodes in the ring.
The picture above shows an inner circle with 4 physical nodes, each node is assigned 7 virtual nodes, so a total of 32 key partitions on the circle.

So what is the effect of the virtual button, when evenly distributing the virtual buttons around the circle, the number of nodes increases, making the key partition small, the small key partition means a lot in the distribution. Cassandra cluster data, smaller partitioning and closer nodes bring the system closer to the fact that all data will be distributed evenly across nodes, the probability of data being put into nodes is balanced. Each other when on a small key space we have all the virtual or physical buttons. The best case scenario is that the physical buttons have their own presence all over the ring.


The partition key manages with and without virtual buttons

Data replication in Cassandra

To satisfy the availability and continuity in Cassandra, each data object can be replicated and stored on multiple servers. If one of the servers saves a corrupted version of the data or is the old version, not the latest version with the latest data update, Cassandra has a sync mechanism to ensure read operations always return. about the latest data. At the same time, Cassandra performs a read repair which is the implicit process to update all replication servers of the data to the latest state. Cassandra organizes the server nodes into clusters in a circular format and the data is distributed around this circle according to a Distributed consistent hashing table. If each Cassandra data is backed up on N nodes, when a key k is decided to save to a node, that node will be considered the moderator node. The coordinator is responsible for distributing that record to the other N-1 node according to the principle: from the coordinator, going clockwise, data will be written to the next 2 nodes encountered.

The above figure shows that when key k is determined to be written to node B, node B will act as a coordinator, rotating that key for the next 2 nodes, node C and node D. thus, node D will store The keys in the region (A; D] The list of keys in this region is called the linked list of node D.

Passing the values ​​of key k to other nodes applies to all write, update, or delete operations. Because deciding the number of nodes to be immediately rotated each time a write takes place has a direct impact on the consistency of the system. In the configuration of Cassandra Apache we have an index of “replicatioon_factor” and “w”. Index “replication_factor will” be set right after initiating a key_space, which is the number of nodes in the ring that will be used to backup data. The metric “w” when configuring Cassandra is the number of nodes that return the result of a required write operation for the action to be considered successful.

As shown in Figure 12, when we set replication_factor = 3 and w = 2, when key k is written, there should be at least 2 of the 3 nodes B, C, and D responding to a successful write. considered successful. The setting of the index “w” shows us how much cost we can spend to ensure immediate data consistency.
Duplication of the data also affects the consistency of the system. Consistency level in both aspects that is read and write data. To maintain data consistency, Cassandra provides users with a variety of read and write consistency levels. From top to bottom, it is possible to adjust consistency based on two configuration parameters “w” and “r” along with the “replication_factor” metric. In which, “w” is the number of nodes to be returned on successful read, “r” is the number of nodes returned on successful read. If consistency is a priority, we can set “w” and “r” for sure

w + r> replication_factor

And make sure “w” or “r” is always less than the replication_factor for better latency.
Suppose that replication_factor = 3, then the best 2 values ​​of “w” and “r” will be 2. That is, each time reading and writing data, at least 2 nodes need to return the value, the task is considered success. And when a read or write task does, it’s always guaranteed to be performed on the latest data that the previous task performed.

Communication between buttons in Cassandra

Every time a Cassandra cluster adds or removes a node from the cluster, the data in the cluster will have to be redistributed. When adding a node, that node takes a piece of other node’s data when it is allocated to 256 virtual nodes. When a node is removed from the cluster, the node’s data will have to be spread evenly to the other nodes. In Cassandra, nodes communicate with each other via the Gossip protocol.
Gossip is a protocol used to update information about the state of other nodes participating in the cluster. This is a peer-to-peer communication protocol in which each node periodically exchanges its status information with the other nodes to which they are associated. The gossip process runs every second and exchanges information with at most three other nodes in the cluster. Nodes exchange information about themselves and with the nodes they have exchanged, in this way all nodes can quickly understand the state of all the remaining nodes in the cluster. A gossip packet includes its associated version, so in every gossip exchange, the old information is overwritten by the latest information in some nodes.

When a node is started, it looks at the cassandra.yaml configuration file to determine the name of the cluster that hosts it and the other nodes in the cluster are configured in the file, known as the seed node.
To prevent interruption in gossip communication, all nodes in the cluster must have the same seed node list listed in the configuration file. Because, most collisions are generated when a node is started. By default, a node will have to remember which nodes it gossiped on even upon reboot, and the seed node will have no other purpose but to update a new node when it joins the cluster. That is, when a node joins the cluster, it communicates with the seed nodes to update the status of all other nodes in the cluster.

In clusters with multiple data centers, the seed node list should contain at least one seed node per data center, otherwise when a new node joins the cluster, it will contact a seed node located on the data center. other. It is also not advisable to leave every node as a seed node as it will decrease the performance of the gossip and make it difficult to maintain. Optimizing the gossip is not as important as it is advisable to use a small list of seed nodes, usually 3 seed nodes per data center.