Cassandra.Yaml and Nodetool Utility from Cassandra


When deploying Cassandra configuring the cassandra.yaml file and using the Nodetool tool are two very important jobs.

Cassandra.yaml setup file

The file contains the settings used when creating a new cluster or when introducing a new node into an existing cluster. This file should be reviewed and changed appropriately every first node start up. These settings control how a node operates in the cluster, such as communication between nodes, data partitioning, and location of the data replication.

Broadcast_address: If your cluster is deployed across multiple Amazon EC2 regions (and you use EC2MultiRegionSnitch), you should set broadcast_address as the node’s public IP address (and private IP address for listen_address). If you do not set broadcast_address, by default it will be the same address as listen_address.

Cluster_name: is the name of the cluster. This value must be the same for all nodes in the cluster.

Commitlog_directory: The path where the commit log is stored. To optimize write speed, DataStax recommends that you commit the log to a drive different from the one where the data file is stored (ideally, physically, not a guest logically).

Data_file_directory: The path to where the column family data is stored (SSTables).

Initial_token: This value assigns the token position of the node in the ring and assigns a range of data to the node when it first starts up. Initial_token may also be unnecessary when introducing a new node to an existing cluster. If not, the token value depends on the partitioner you are using. For a random partitioner, this will range from 0 to 2 to the exponent 127. For ByteOrderPreservingPartitioner, this value will be a byte array of hex values ​​in your actual row key value. With OrderPreservingPartitioner and CollatedOrderPreservingPartitioner, this value will be a UTF-8 string based on your actual row key value.

Listen_address: This is the IP address or hostname that other Cassandra nodes will use to connect to this node. If you leave it blank, you need to have a hostname range set correctly on all nodes in the cluster so that those hostnames can pass to the correct IP address of this node (using / etc / hostname, / etc /). hosts or DNS).

Partitioner: I mentioned in the article about partitioner and replication.

Rpc_address: The listening address of the remote procedure call (client connections). To hear all interfaces that have been set up, set the value to If you leave it blank, you need to have a hostname range set correctly on all nodes in the cluster so that those hostnames can pass to the correct IP address of this node (using / etc / hostname, / etc /). hosts or DNS). Default value: localhost. Allowable values: an IP address, a hostname, or to look to an address using the hostname set at the node.

Rpc_port: A port for remote procedure calls (client links) and Thrift service. The default value is 9160.

Saved_caches_directory: This is the path to where the column family key and the row cache are stored.

Seed_provider: is a pluggable interface that provides a list of seed nodes. Seed nodes in the list are separated by commas.

Seeds: When a node joins the cluster, it contacts the seed nodes to determine the geometrical structure of the ring and get gossip information from other nodes in the clustwr. Every node in the cluster must have the same list of seed nodes. In a cluster spread across multiple data centers, each data center must contain at least one seed node.

Storage_port: A port for internal communication between nodes. Default is 7000.

Endpoint_snitch: Set snitch to set the node location and find the path for the request. Snitch in Cassandra includes:





Nodetool utility

Is a command line interface used to manage a cluster.

Statement structure:


Most nodetool commands operate in a node if –h is not used to identify one or more other nodes. The following commands operate in the whole cluster: rebuild, repair, taketoken.

If you execute commands from the node you bridge control, you do not need to add the -h option to the command, otherwise you can use -h to identify another node you need to control.

cfhistograms: provides statistics about the tables, including the number of SSTables, implicit read / write, partition size (row), and cell count.

cfstats: Provide statistics about one or more tables. You can use dot (dot) notation to designate one or more keyspace and table names. If you do not specify a keyspace and table, Cassandra will show statistics for all tables.

cleanup: use this command to remove unnecessary data after adding a node into the cluster. Cassandra does not automatically remove data from nodes that have lost their partition space to a new node. Run the nodetool cleanup at the source node and the neighboring node shares a subrange after the new node is added. After adding a new node, if running this command fails, Cassandra will have to include old data to rebalance at that node. Running nodetool cleanup temporarily increases the disk usage rate for the largest SSTable size. Disk I / O occurred while running this command.

Running this statement affects nodes that have column counters in a table. Cassandra assigns a new counter ID to the node.

Optionally, this statement retrieves a list of table names. If you do not specify a keyspace, this command will delete all keyspaces that no longer belong to a node.

cleansnapshot: Deletes all snapshots in one or more keyspaces. To delete all snapshots omit the snapshot name.

compact:This statement starts the compaction process at the tables using SizeTieredCompactionStrategy. You can specify a keyspace for compression. If you do not specify the keyspace, the nodetool command will use the current keyspace. You can select only one or several tables for compression. If you do not specify a specific table, compression will take place for all the tables in the current keyspace. This is called a major compaction. If you specify a specific table, the compression will only take place in that table. This is called the minor compaction. A major compaction compresses all existing SSTables into one SSTable. During compression, there will be apex points in the disk resource usage or disk I / O. This occurrence is temporary due to periods of existence of both new and old SSTable.

compactionhistory: Provides a history of compaction attempts.

compactionstats: Shows compaction statistics. The column count indicates the total number of uncompressed SSTable bytes being compressed. The system log lists the names of the compressed SSTables.

decommission: deactivate a node by streaming its data to another node.

describering: Provides the partition range of the keyspace.

disableautocompaction: Disables autocompaction for keyspace and one or more tables. A keyspace can have one or more tables.

disablebackup: disable incremental backup.

disablebinary: disables the binary protocol, also known as native transport.

disablegossip: disable gossip protocol. This command is often used to pretend a node is off.

disablehandoff: disables the storage of future hint at the current node.

disablethrift: disables the thrift server.

drain: fllush all the memtables of a node and cause the node to stop writing operations. Reading operations are still working properly. You usually use this command before updating a node to a new version of Cassandra.

enableautocompaction, enablebackup, enablebinary, enablegossip, enablehandoff, enablethrift: in contrast to disableautocompaction, disablebackup, diablebinary, disablegosssip, diablehandoff, disablethrift commands.

flush: flush one or more tables in memtable.

getcompactionthreshold: Provides the maximum and minimum compaction thresholds in megabytes in a table.

getendpoints: provide the end point containing a partition key.

getsstable: Provides the SSTable containing the partition key.

getstreamthroughput: Provides throughput limit of streaming in the system in bytes per second.

gossipinfo: Provides gossip information for the cluster.

info: provides information including tokens and information about the disk capacity (load), the start time (initialization), the run time in seconds (uptime in seconds), and the amount of heap memory used.

invalidatekeycache: Reset global key cache parameters to default values ​​and save all keys. By default, key_cache_keys_to_save is disabled in cassandra.yaml. This statement resets these parameters to default.

invalidaterowcache: Reset global key cache parameters, row_cache_keys_to_save, to default.

join: Let the node join a ring (cluster), assuming that this node does not initially start up inside the ring using the option –D join_ring = false of the cassandra utility. This node needs to be set up correctly in the seed list