If your Apache Cassandra cluster is suddenly throwing ReadTimeoutException or Scanned over 100000 tombstones errors, you are likely hitting the tombstone failure threshold. This happens when the coordinator nodes attempt to merge data from multiple SSTables but encounter too many "deleted" markers before finding enough live data to satisfy the consistency level. In production environments using Cassandra 4.x or earlier, this often results in p99 latency spikes that can bring your application to a halt. You can fix this immediately by triggering manual compactions or safely lowering gc_grace_seconds if you don't use cross-datacenter replication.
TL;DR — Identify offending tables via nodetool tablestats, reduce gc_grace_seconds to accelerate purging, and transition to a "Time Window Compaction Strategy" (TWCS) for time-series data to prevent future buildup.
Table of Contents
Symptoms of Tombstone Overload
When tombstones accumulate beyond the tombstone_failure_threshold (default is 100,000 in cassandra.yaml), the node stops the read operation to protect itself from OutOfMemory (OOM) errors. You will see the following trace in your system.log:
ERROR [ReadStage-4] 2023-10-27 10:15:22,123 StorageProxy.java:2345 - Scanned over 100000 tombstones during query 'SELECT * FROM users_keyspace.active_sessions WHERE id = ...' (last scanned row key was ...); query aborted
In your application logs, this manifests as a generic ReadTimeoutException. This is misleading because it isn't necessarily a network issue; it is a computational timeout. When I managed a 50-node cluster, we found that even 1,000 tombstones per read request could increase latencies from 5ms to 250ms, well before the hard failure threshold was reached.
The Anatomy of a ReadTimeoutException
Frequent Deletes and Updates
In Cassandra, a DELETE is actually an INSERT of a tombstone. Similarly, setting a column to null or updating a collection (List/Set/Map) creates a tombstone for the old value. If your application logic relies on "Delete before Insert" patterns or frequently clears data, you are creating a dense layer of markers that Cassandra must filter through during every read operation.
Compaction Lag
Cassandra removes tombstones only during the compaction process, and only after the gc_grace_seconds period has passed. If your compaction throughput is lower than your write throughput, SSTables accumulate. When a read request spans five different SSTables, and each contains 20,000 tombstones for the same partition key, you hit the 100,000 threshold immediately.
Overlapping SSTables
Tombstones cannot be deleted if the data they are supposed to "cover" exists in a different SSTable that is not being compacted at the same time. This is common with SizeTieredCompactionStrategy (STCS), where old data might sit in a large SSTable for weeks, preventing the deletion of tombstones in smaller, newer SSTables.
How to Clear the Tombstone Backlog
Step 1: Lower gc_grace_seconds
The default gc_grace_seconds is 864,000 (10 days). This is designed to ensure that if a node goes down, it has 10 days to recover and receive tombstone information. If you run repairs daily or use a single datacenter, you can safely lower this to 1 or 2 days (86,400 or 172,800 seconds). Use the following CQL command:
ALTER TABLE users_keyspace.active_sessions WITH gc_grace_seconds = 86400;
Step 2: Trigger User-Defined Compaction
Changing the setting doesn't delete tombstones immediately. You must trigger a compaction to sweep them up. For a quick fix, you can run a "major" compaction, though use this cautiously as it creates one giant SSTable and can cause disk I/O spikes.
nodetool compact users_keyspace active_sessions
gc_grace_seconds to 0 unless you are absolutely sure you will never need to recover a node from a snapshot or hint. Setting it to 0 can result in "zombie data," where deleted items reappear because the tombstone was purged before all replicas were updated.
Verification and Monitoring
After compaction, you should verify that the tombstone-to-live-cell ratio has improved. Use nodetool tablestats to inspect the "Maximum tombstone per slice" metric. If the number is still high, you may have "shadowed" data that hasn't been compacted yet.
nodetool tablestats users_keyspace.active_sessions | grep -A 5 "Tombstones"
An ideal "Average tombstones per slice" should be less than 10. If you see values in the thousands, your read performance will remain degraded. Monitor the org.apache.cassandra.metrics.Table.TombstoneScannedHistogram metric in your dashboard (Grafana/Prometheus) to catch these spikes before they hit the failure threshold.
Long-term Prevention Strategies
Use TTL Instead of DELETE
If your data expires (e.g., session tokens, logs), use Time To Live (TTL) at the application level. Cassandra handles TTL-based expiration more efficiently than manual deletes, especially when combined with TimeWindowCompactionStrategy (TWCS).
Switch to LeveledCompactionStrategy (LCS)
For workloads with many updates and deletes, LeveledCompactionStrategy is often superior to the default STCS. LCS keeps SSTables small and ensures that 90% of reads can be satisfied by a single SSTable, which limits the number of tombstones the coordinator has to scan.
- Identify tombstone hotspots using
nodetool tablestats. - Lower
gc_grace_secondsif you have a frequent repair schedule. - Run
nodetool compactto force the removal of expired tombstones. - Avoid data models that require frequent updates to the same row.
Frequently Asked Questions
Q. Can I just increase the tombstone_failure_threshold?
A. While you can increase it in cassandra.yaml, it is rarely recommended. Increasing the threshold allows queries to scan more tombstones, which consumes massive amounts of heap memory and can lead to Stop-the-World GC pauses or node crashes. It hides the symptom rather than fixing the cause.
Q. How does gc_grace_seconds affect data consistency?
A. It defines how long a tombstone is guaranteed to stay in the cluster. If a node is down longer than this period and you haven't run a repair, that node might miss the tombstone. When it comes back online, it will treat the old data as "new" and replicate it back to other nodes.
Q. Why does SELECT COUNT(*) trigger tombstone errors?
A. A count operation must scan every single row and tombstone in the partition or table. Because it touches every marker, it is the most likely query to hit the tombstone_failure_threshold. Avoid counts on large tables with high delete volume.
Post a Comment