Fixing Cassandra ReadTimeoutExceptions Caused by Tombstones

If your Apache Cassandra cluster is suddenly throwing ReadTimeoutException or Scanned over 100000 tombstones errors, you are likely hitting the tombstone failure threshold. This happens when the coordinator nodes attempt to merge data from multiple SSTables but encounter too many "deleted" markers before finding enough live data to satisfy the consistency level. In production environments using Cassandra 4.x or earlier, this often results in p99 latency spikes that can bring your application to a halt. You can fix this immediately by triggering manual compactions or safely lowering gc_grace_seconds if you don't use cross-datacenter replication.

TL;DR — Identify offending tables via nodetool tablestats, reduce gc_grace_seconds to accelerate purging, and transition to a "Time Window Compaction Strategy" (TWCS) for time-series data to prevent future buildup.

Symptoms of Tombstone Overload

💡 Analogy: Imagine searching for a book in a library where 90% of the shelves are filled with "Out of Print" bookmarks instead of actual books. The librarian (Cassandra) spends all their time reading the bookmarks to see if a book used to be there, eventually giving up because it's taking too long to find the one book you actually asked for.

When tombstones accumulate beyond the tombstone_failure_threshold (default is 100,000 in cassandra.yaml), the node stops the read operation to protect itself from OutOfMemory (OOM) errors. You will see the following trace in your system.log:

ERROR [ReadStage-4] 2023-10-27 10:15:22,123 StorageProxy.java:2345 - Scanned over 100000 tombstones during query 'SELECT * FROM users_keyspace.active_sessions WHERE id = ...' (last scanned row key was ...); query aborted

In your application logs, this manifests as a generic ReadTimeoutException. This is misleading because it isn't necessarily a network issue; it is a computational timeout. When I managed a 50-node cluster, we found that even 1,000 tombstones per read request could increase latencies from 5ms to 250ms, well before the hard failure threshold was reached.

The Anatomy of a ReadTimeoutException

Frequent Deletes and Updates

In Cassandra, a DELETE is actually an INSERT of a tombstone. Similarly, setting a column to null or updating a collection (List/Set/Map) creates a tombstone for the old value. If your application logic relies on "Delete before Insert" patterns or frequently clears data, you are creating a dense layer of markers that Cassandra must filter through during every read operation.

Compaction Lag

Cassandra removes tombstones only during the compaction process, and only after the gc_grace_seconds period has passed. If your compaction throughput is lower than your write throughput, SSTables accumulate. When a read request spans five different SSTables, and each contains 20,000 tombstones for the same partition key, you hit the 100,000 threshold immediately.

Overlapping SSTables

Tombstones cannot be deleted if the data they are supposed to "cover" exists in a different SSTable that is not being compacted at the same time. This is common with SizeTieredCompactionStrategy (STCS), where old data might sit in a large SSTable for weeks, preventing the deletion of tombstones in smaller, newer SSTables.

How to Clear the Tombstone Backlog

Step 1: Lower gc_grace_seconds

The default gc_grace_seconds is 864,000 (10 days). This is designed to ensure that if a node goes down, it has 10 days to recover and receive tombstone information. If you run repairs daily or use a single datacenter, you can safely lower this to 1 or 2 days (86,400 or 172,800 seconds). Use the following CQL command:

ALTER TABLE users_keyspace.active_sessions WITH gc_grace_seconds = 86400;

Step 2: Trigger User-Defined Compaction

Changing the setting doesn't delete tombstones immediately. You must trigger a compaction to sweep them up. For a quick fix, you can run a "major" compaction, though use this cautiously as it creates one giant SSTable and can cause disk I/O spikes.

nodetool compact users_keyspace active_sessions
⚠️ Common Mistake: Do not set gc_grace_seconds to 0 unless you are absolutely sure you will never need to recover a node from a snapshot or hint. Setting it to 0 can result in "zombie data," where deleted items reappear because the tombstone was purged before all replicas were updated.

Verification and Monitoring

After compaction, you should verify that the tombstone-to-live-cell ratio has improved. Use nodetool tablestats to inspect the "Maximum tombstone per slice" metric. If the number is still high, you may have "shadowed" data that hasn't been compacted yet.

nodetool tablestats users_keyspace.active_sessions | grep -A 5 "Tombstones"

An ideal "Average tombstones per slice" should be less than 10. If you see values in the thousands, your read performance will remain degraded. Monitor the org.apache.cassandra.metrics.Table.TombstoneScannedHistogram metric in your dashboard (Grafana/Prometheus) to catch these spikes before they hit the failure threshold.

Long-term Prevention Strategies

Use TTL Instead of DELETE

If your data expires (e.g., session tokens, logs), use Time To Live (TTL) at the application level. Cassandra handles TTL-based expiration more efficiently than manual deletes, especially when combined with TimeWindowCompactionStrategy (TWCS).

Switch to LeveledCompactionStrategy (LCS)

For workloads with many updates and deletes, LeveledCompactionStrategy is often superior to the default STCS. LCS keeps SSTables small and ensures that 90% of reads can be satisfied by a single SSTable, which limits the number of tombstones the coordinator has to scan.

📌 Key Takeaways
  • Identify tombstone hotspots using nodetool tablestats.
  • Lower gc_grace_seconds if you have a frequent repair schedule.
  • Run nodetool compact to force the removal of expired tombstones.
  • Avoid data models that require frequent updates to the same row.

Frequently Asked Questions

Q. Can I just increase the tombstone_failure_threshold?

A. While you can increase it in cassandra.yaml, it is rarely recommended. Increasing the threshold allows queries to scan more tombstones, which consumes massive amounts of heap memory and can lead to Stop-the-World GC pauses or node crashes. It hides the symptom rather than fixing the cause.

Q. How does gc_grace_seconds affect data consistency?

A. It defines how long a tombstone is guaranteed to stay in the cluster. If a node is down longer than this period and you haven't run a repair, that node might miss the tombstone. When it comes back online, it will treat the old data as "new" and replicate it back to other nodes.

Q. Why does SELECT COUNT(*) trigger tombstone errors?

A. A count operation must scan every single row and tombstone in the partition or table. Because it touches every marker, it is the most likely query to hit the tombstone_failure_threshold. Avoid counts on large tables with high delete volume.

Post a Comment