Common Errors and Solutions in Apache Cassandra
Cassandra common errors and solutions diagram
Introduction
Apache Cassandra is a highly scalable, distributed NoSQL database designed for handling large amounts of data across multiple nodes with no single point of failure. However, like any distributed system, Cassandra can present challenges during setup, configuration, and operation.
In this article, we’ll cover the most common Cassandra errors and their solutions to help you quickly diagnose and resolve issues, ensuring optimal performance and stability.
Error Message Example:
ReadTimeoutException: Operation timed out - received only 1 responses from 3 replicas
Cause:
This occurs when the coordinator node does not receive enough replica responses within the read timeout window. It’s usually due to:
High latency between nodes
Overloaded nodes
Incorrect read consistency level
Solution:
Increase read_request_timeout_in_ms
in cassandra.yaml
(use with caution).
Tune the consistency level (e.g., use LOCAL_QUORUM
instead of QUORUM
for multi-datacenter clusters).
Optimize the query and reduce data load on hot partitions.
Monitor slow queries using nodetool toppartitions.
Error Message Example:
WriteTimeoutException: Operation timed out - received only 1 acknowledgments from 3 replicas
Cause:
The coordinator didn’t receive acknowledgments from the required number of replicas before the timeout period expired.
Solution:
Check for network latency or overloaded nodes.
Increase the write_request_timeout_in_ms
.
Verify sufficient disk space and performance.
Balance writes across partitions — avoid hot partition keys.
Error Message Example:
UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM
Cause:
This happens when the coordinator node cannot contact enough replicas to satisfy the requested consistency level.
Solution:
Check cluster health using nodetool status
.
Ensure the required number of replicas are up and reachable.
Lower the consistency level temporarily to continue reads/writes.
Replace failed nodes if necessary.
Error Message Example:
ReadFailureException: Too many tombstones encountered during scan
Cause:
Cassandra marks deleted data with tombstones instead of removing it immediately. When too many tombstones exist in a partition, reads become expensive and may time out.
Solution:
Avoid frequent updates/deletes on the same partition.
Use TTL (Time to Live) wisely.
Run nodetool compact
to clean up tombstones after gc_grace_seconds
.
Monitor tombstone counts using nodetool tablestats
.
Error Message Example:
OutOfMemoryError: Java heap space
Cause:
Occurs when Cassandra runs out of heap memory due to excessive data or large compactions.
Solution:
Tune JVM heap size in cassandra-env.sh
(recommended 8GB–16GB).
Avoid oversized SSTables and use leveled compaction for better memory control.
Monitor with nodetool gcstats and external tools like Prometheus + Grafana.
Error Message Example:
NoHostAvailableException: All host(s) tried for query failed
Cause:
Occurs when Cassandra client cannot connect to any node.
Solution:
Verify Cassandra is running using sudo systemctl status cassandra
.
Check firewall and port (default: 9042 for CQL).
Confirm correct IP addresses in cassandra.yaml
(listen_address
, rpc_address
).
Restart Cassandra after fixing configuration issues.
Error Message Example:
WriteFailureException: Insufficient disk space
Cause:
When disk utilization is high, Cassandra cannot perform writes or compactions.
Solution:
Regularly monitor disk usage with df -h
.
Move commit logs or data directories to larger drives.
Configure disk_failure_policy
to “stop” or “best_effort” depending on your tolerance.
Enable automatic cleanup or data archiving policies.
Error Message Example:
Node x.x.x.x is not joining the cluster due to gossip failure
Cause:
Occurs when nodes can’t communicate due to network misconfigurations.
Solution:
Check that all nodes share the same cluster name in cassandra.yaml
.
Ensure consistent seeds
configuration.
Verify time synchronization using NTP.
Restart nodes and validate using nodetool gossipinfo
.
Error Message Example:
ReadRepair is failing or data mismatch detected between replicas
Cause:
Inconsistent replicas due to missed repairs or node outages.
Solution:
Schedule regular anti-entropy repairs using nodetool repair
.
Use nodetool verify
to detect corruption.
For multi-datacenter clusters, prefer incremental repair.
Monitor regularly with tools like Prometheus, Grafana, and DataStax OpsCenter.
Use proper data modeling — design around queries, not normalization.
Perform regular backups and repairs.
Distribute load evenly to avoid hot partitions.
Keep Cassandra and Java versions updated.
Apache Cassandra is built for high availability and scalability, but operational errors can degrade performance if not handled properly. By understanding the common errors and solutions, you can ensure your Cassandra cluster runs smoothly and efficiently.