Apache Cassandra Troubleshooting Guide

10/15/2025

Apache Cassandra Troubleshooting Guide

Go Back

Apache Cassandra Troubleshooting Guide

Introduction

Apache Cassandra is one of the most popular NoSQL distributed databases, designed to handle massive data workloads with high availability and scalability. However, due to its distributed nature, administrators often face configuration, performance, and consistency issues.

This Cassandra Troubleshooting Guide (2025) will help you diagnose and fix the most common problems in your Cassandra cluster — from node failures to data inconsistencies — ensuring smooth performance and reliability.


1. Understanding Cassandra Troubleshooting

Before you start fixing issues, it’s crucial to understand Cassandra’s architecture:

  • Nodes → Individual database instances.

  • Clusters → Groups of nodes working together.

  • Keyspaces & Tables → Logical data structures.

  • Replication → Ensures fault tolerance across nodes.

When an issue occurs, Cassandra’s logs, metrics, and nodetool commands are your best friends for troubleshooting.


2. Common Cassandra Problems and Their Solutions

a) UnavailableException: Cannot achieve consistency level

Cause:
One or more nodes required for a specific consistency level are not available.

Solution:

 
nodetool status
  • Restart or repair failed nodes:

     
    nodetool repair
  • Reduce consistency level if replication factor is low:

     
    CONSISTENCY QUORUM;

b) WriteTimeoutException / ReadTimeoutException

Cause:
Cassandra didn’t receive responses from enough replicas within the timeout period.

Solution:

  • Increase timeout values in cassandra.yaml:

     
    write_request_timeout_in_ms: 10000 read_request_timeout_in_ms: 10000
  • Avoid heavy queries scanning large partitions.

  • Add nodes or improve hardware performance.


c) NoHostAvailableException

Cause:
Client cannot connect to any Cassandra nodes.

Solution:

  • Check if Cassandra service is running:

     
    sudo service cassandra status
  • Ensure native transport port 9042 is open.

  • Verify correct IPs and contact points in your application driver.


d) OutOfMemoryError

Cause:
Heap space is insufficient for the workload.

Solution:

  • Edit heap size in jvm.options:

     
    -Xms4G -Xmx4G
  • Avoid large partitions and use pagination.

  • Use nodetool info to monitor memory usage.


e) Corrupted SSTables or CommitLogs

Cause:
Power loss or hardware failures cause data file corruption.

Solution:

  • Run scrub tool:

     
    nodetool scrub keyspace_name table_name
  • Backup data regularly using snapshots.

  • Avoid abrupt shutdowns.


f) Tombstone Overload

Cause:
Frequent deletions create tombstones, slowing reads.

Solution:

  • Use TTL (Time-To-Live) for data expiry.

  • Avoid large deletes and updates.

  • Compact tables:

     
    nodetool compact

g) Disk Full or Disk Failure

Cause:
Disk usage exceeds 90% due to large SSTables or commit logs.

Solution:

  • Clear old snapshots:

     
    nodetool clearsnapshot
  • Move commit logs to separate drives.

  • Monitor disk space regularly using df -h.


๐Ÿ” 3. Key Troubleshooting Tools

ToolPurpose
nodetoolManage and inspect cluster status
cqlshExecute Cassandra Query Language (CQL)
sstableloaderLoad and repair SSTables
OpsCenterVisualize metrics and manage clusters
system.logView error messages and warnings

4. Performance Troubleshooting

Slow Queries

  • Use EXPLAIN PLAN in CQL to analyze query execution.

  • Avoid ALLOW FILTERING.

  • Use appropriate partition keys for efficient lookups.

High Latency

  • Monitor GC (Garbage Collection) activity.

  • Increase concurrent reads/writes in cassandra.yaml.

  • Tune Linux kernel parameters (vm.swappiness, ulimit).


๐Ÿง  5. Preventive Maintenance Tips

โœ… Regularly run:

 
nodetool repair nodetool cleanup nodetool compact

โœ… Monitor metrics using:

  • Prometheus + Grafana

  • DataStax OpsCenter

  • ELK Stack for log aggregation

โœ… Always test schema changes in a staging cluster before production rollout.


๐Ÿงพ 6. Cassandra Cluster Health Checklist

CheckFrequencyCommand
Node StatusDailynodetool status
Disk UsageWeeklydf -h
Repair ProcessWeeklynodetool repair
CompactionMonthlynodetool compact
Backup SnapshotWeeklynodetool snapshot

Conclusion

Effective Cassandra troubleshooting requires a good understanding of cluster behavior, monitoring tools, and common error patterns.
By following this guide and applying preventive maintenance, you can ensure your Cassandra cluster remains healthy, efficient, and ready for high-performance workloads.

 

 Apache Cassandra Troubleshooting Guide