Apache Cassandra Troubleshooting Guide

Introduction

Apache Cassandra is one of the most popular NoSQL distributed databases, designed to handle massive data workloads with high availability and scalability. However, due to its distributed nature, administrators often face configuration, performance, and consistency issues.

This Cassandra Troubleshooting Guide (2025) will help you diagnose and fix the most common problems in your Cassandra cluster — from node failures to data inconsistencies — ensuring smooth performance and reliability.

1. Understanding Cassandra Troubleshooting

Before you start fixing issues, it’s crucial to understand Cassandra’s architecture:

Nodes → Individual database instances.
Clusters → Groups of nodes working together.
Keyspaces & Tables → Logical data structures.
Replication → Ensures fault tolerance across nodes.

When an issue occurs, Cassandra’s logs, metrics, and nodetool commands are your best friends for troubleshooting.

2. Common Cassandra Problems and Their Solutions

a) UnavailableException: Cannot achieve consistency level

Cause:
One or more nodes required for a specific consistency level are not available.

Solution:

Restart or repair failed nodes:
nodetool repair
Reduce consistency level if replication factor is low:
CONSISTENCY QUORUM;

b) WriteTimeoutException / ReadTimeoutException

Cause:
Cassandra didn’t receive responses from enough replicas within the timeout period.

Solution:

Increase timeout values in cassandra.yaml:
write_request_timeout_in_ms: 10000 read_request_timeout_in_ms: 10000
Avoid heavy queries scanning large partitions.
Add nodes or improve hardware performance.

c) NoHostAvailableException

Cause:
Client cannot connect to any Cassandra nodes.

Solution:

Check if Cassandra service is running:
sudo service cassandra status
Ensure native transport port 9042 is open.
Verify correct IPs and contact points in your application driver.

d) OutOfMemoryError

Cause:
Heap space is insufficient for the workload.

Solution:

Edit heap size in jvm.options:
-Xms4G -Xmx4G
Avoid large partitions and use pagination.
Use nodetool info to monitor memory usage.

e) Corrupted SSTables or CommitLogs

Cause:
Power loss or hardware failures cause data file corruption.

Solution:

Run scrub tool:
nodetool scrub keyspace_name table_name
Backup data regularly using snapshots.
Avoid abrupt shutdowns.

f) Tombstone Overload

Cause:
Frequent deletions create tombstones, slowing reads.

Solution:

Use TTL (Time-To-Live) for data expiry.
Avoid large deletes and updates.
Compact tables:
nodetool compact

g) Disk Full or Disk Failure

Cause:
Disk usage exceeds 90% due to large SSTables or commit logs.

Solution:

Clear old snapshots:
nodetool clearsnapshot
Move commit logs to separate drives.
Monitor disk space regularly using df -h.

🔍 3. Key Troubleshooting Tools

Tool	Purpose
`nodetool`	Manage and inspect cluster status
`cqlsh`	Execute Cassandra Query Language (CQL)
`sstableloader`	Load and repair SSTables
`OpsCenter`	Visualize metrics and manage clusters
`system.log`	View error messages and warnings

4. Performance Troubleshooting

Slow Queries

Use EXPLAIN PLAN in CQL to analyze query execution.
Avoid ALLOW FILTERING.
Use appropriate partition keys for efficient lookups.

High Latency

Monitor GC (Garbage Collection) activity.
Increase concurrent reads/writes in cassandra.yaml.
Tune Linux kernel parameters (vm.swappiness, ulimit).

🧠 5. Preventive Maintenance Tips

✅ Regularly run:

✅ Monitor metrics using:

Prometheus + Grafana
DataStax OpsCenter
ELK Stack for log aggregation

✅ Always test schema changes in a staging cluster before production rollout.

🧾 6. Cassandra Cluster Health Checklist

Check	Frequency	Command
Node Status	Daily	`nodetool status`
Disk Usage	Weekly	`df -h`
Repair Process	Weekly	`nodetool repair`
Compaction	Monthly	`nodetool compact`
Backup Snapshot	Weekly	`nodetool snapshot`

Conclusion

Effective Cassandra troubleshooting requires a good understanding of cluster behavior, monitoring tools, and common error patterns.
By following this guide and applying preventive maintenance, you can ensure your Cassandra cluster remains healthy, efficient, and ready for high-performance workloads.

Apache Cassandra Troubleshooting Guide

1. Understanding Cassandra Troubleshooting

2. Common Cassandra Problems and Their Solutions

a) UnavailableException: Cannot achieve consistency level

b) WriteTimeoutException / ReadTimeoutException

c) NoHostAvailableException

d) OutOfMemoryError

e) Corrupted SSTables or CommitLogs

f) Tombstone Overload

g) Disk Full or Disk Failure

🔍 3. Key Troubleshooting Tools

4. Performance Troubleshooting

Slow Queries

High Latency

🧠 5. Preventive Maintenance Tips

🧾 6. Cassandra Cluster Health Checklist

Conclusion

Table of content