If your application suddenly starts failing with a ProvisionedThroughputExceededException, you are likely hitting a performance wall in Amazon DynamoDB. Even if you have provisioned thousands of Write Capacity Units (WCUs), a single "hot" partition can bring your entire ingestion pipeline to a halt. This happens when your traffic hits one specific partition key too hard, exceeding the 3,000 Read Capacity Units (RCUs) or 1,000 WCUs limit per physical partition.
To resolve this, you must move beyond simple key-value lookups and implement advanced partitioning strategies like write sharding. By distributing high-velocity writes across multiple logical keys, you ensure that no single physical server becomes a bottleneck. In this guide, you will learn how to identify hot keys, implement sharding logic, and configure your table to absorb unpredictable traffic spikes.
TL;DR — Stop using sequential or low-cardinality values as partition keys. Use Write Sharding by appending a random suffix (e.g., PartitionKey_1, PartitionKey_2) to distribute the load. For unpredictable traffic, switch to DynamoDB On-Demand capacity to eliminate manual provisioning errors.
Table of Contents
- Symptoms of a Hot Partition
- What Causes ProvisionedThroughputExceededException?
- How to Fix Hot Partitions and Throttling
- Verifying the Fix with CloudWatch
- How to Prevent Hot Keys in the Design Phase
- Frequently Asked Questions
Symptoms of a Hot Partition
💡 Analogy: The Coffee Shop Bottleneck
Imagine a coffee shop with 10 registers (partitions). If 100 people enter and all of them line up at Register #1 because they want the "Special of the Day," Register #1 will be overwhelmed. The other 9 registers sit idle, even though the shop technically has the "capacity" to serve 100 people. This is a hot partition.
The primary symptom is a 400 Bad Request error in your application logs. When you look at your AWS SDK output, you will see a specific exception message that looks like this:
{
"__type": "com.amazonaws.dynamodb.v20120810#ProvisionedThroughputExceededException",
"message": "The level of configured provisioned throughput for the table was exceeded.
Consider increasing your provisioning level with the UpdateTable API."
}
However, simply increasing the provisioning level often fails to fix the problem. If you increase your table capacity from 1,000 WCUs to 5,000 WCUs but 90% of your writes still target a single partition key, that specific partition remains capped at its physical limit (1,000 WCUs). You will see your CloudWatch ConsumedWriteCapacityUnits stay well below your ProvisionedWriteCapacityUnits, yet your ThrottledRequests count will continue to climb. This gap between "provisioned" and "actually used" capacity is the definitive proof of a hot partition.
Another symptom is increased latency in your application. Because the AWS SDK automatically retries throttled requests with exponential backoff, your API response times will spike. A request that usually takes 10ms might take 500ms or 2 seconds as the SDK struggles to find a window to push the data through the throttled partition.
What Causes ProvisionedThroughputExceededException?
Low Cardinality Partition Keys
The most common cause is choosing a Partition Key (PK) with too few unique values. If you are building an IoT platform and use status (e.g., "ACTIVE" or "INACTIVE") as your PK, every single device update will hit one of those two keys. DynamoDB hashes the PK to determine which physical partition stores the data. Since there are only two possible hashes, all your data will reside on two partitions. As your device count grows, those two partitions will inevitably exceed the 1,000 WCU limit.
Sequential Data Patterns (Hot Keys)
Using a timestamp or a date as a partition key is a classic mistake. If your PK is 2023-10-27, every single write happening today will target the same partition. Tomorrow, that partition will go cold, and a new one will become hot. This "moving hot spot" makes it impossible for DynamoDB's internal adaptive capacity to balance the load effectively, as the traffic pattern shifts faster than the system can re-partition.
Sudden Traffic Spikes
Even with a well-designed key, a sudden burst of traffic—such as a marketing push or a flash sale—can overwhelm a partition. While DynamoDB offers "Adaptive Capacity" to boost a single partition's throughput using the table's unused capacity, this process is not instantaneous. If your traffic goes from 100 to 10,000 requests per second in a single tick, the partition will throttle until the system adjusts. This is especially prevalent in tables using Provisioned Mode rather than On-Demand Mode.
How to Fix Hot Partitions and Throttling
Solution 1: Implement Write Sharding (Random Suffixes)
Write sharding is the most effective way to handle high-velocity ingestion. Instead of writing to a single key like SensorData, you append a random number or a calculated hash to the end of the key. This forces DynamoDB to spread the data across multiple physical partitions.
If you need to write 5,000 items per second to the same logical group, you can shard the key into 10 buckets (0-9). Your application logic would look like this:
// Node.js Example: Write Sharding Logic
const shardCount = 10;
const randomShard = Math.floor(Math.random() * shardCount);
const partitionKey = `SensorData_${randomShard}`;
const params = {
TableName: "MyTable",
Item: {
"PK": { S: partitionKey },
"Timestamp": { N: Date.now().toString() },
"Data": { S: "..." }
}
};
// This distributes writes across 10 different partitions
await dynamodb.putItem(params).promise();
When you need to read this data back, you will perform 10 parallel queries (one for each shard) and merge the results in your application code. This is a small trade-off in read complexity for a massive gain in write scalability.
Solution 2: Switch to DynamoDB On-Demand Capacity
If your traffic is unpredictable, stop using Provisioned Capacity. Provisioned mode requires you to guess your peak load. On-Demand mode, however, scales instantly to accommodate up to double your previous peak traffic. This eliminates ProvisionedThroughputExceededException caused by table-level limits.
⚠️ Common Mistake: On-Demand does not fix hot partitions. It only fixes table-wide throttling. If you target 5,000 WCUs at a single PK, On-Demand will still throttle you because the physical 1,000 WCU per partition limit is a hardware constraint, not a billing constraint.
Solution 3: Use DynamoDB Accelerator (DAX)
If your hot partition issue is caused by reads (e.g., millions of users reading the same configuration item), use DAX. DAX is a fully managed, highly available, in-memory cache for DynamoDB. It sits in front of your table and intercepts read requests. This prevents read-heavy hot keys from ever hitting the DynamoDB partition level, reducing RCU consumption and latency.
Verifying the Fix with CloudWatch
After implementing write sharding or switching to On-Demand, you must verify that your throughput is healthy. Open the CloudWatch console and look for the following metrics under the DynamoDB namespace:
- ThrottledRequests: This should trend toward zero. If it remains high, your sharding factor (the number of suffixes) might be too low.
- SuccessfulRequestLatency: A healthy table should show average latencies under 10-15ms. If this remains high, your SDK is still retrying due to throttling.
- ConsumedWriteCapacityUnits vs ProvisionedWriteCapacityUnits: In a balanced table, these lines should follow similar patterns. If consumed is low but throttling is high, you still have a hot partition.
You can also use DynamoDB Contributor Insights. This tool provides a real-time graph of the most accessed keys in your table. If you see one specific key dominating 80% of the traffic, you have found your hot key. This is the most direct way to identify exactly which data needs to be sharded.
How to Prevent Hot Keys in the Design Phase
Scalable NoSQL design requires thinking about the physical storage layer. To prevent hot partitions in future projects, follow these three rules:
- High Cardinality: Choose partition keys with thousands or millions of unique values. User IDs, Device IDs, or UUIDs are excellent partition keys.
- Avoid "Monotonic" Keys: Never use a timestamp or a sequence as a PK. If you need to query by time, put the timestamp in the Sort Key (SK) and use a more distributed value for the PK.
- Calculate Shard Requirements: If you know a specific entity will receive 5,000 writes per second, and you know one partition supports 1,000, you must shard that entity into at least 5 buckets. In practice, use 10 buckets to allow for growth.
📌 Key Takeaways
- Hot partitions occur when one physical partition exceeds 1,000 WCU or 3,000 RCU.
- Increasing provisioned capacity does not fix a single hot partition.
- Write Sharding distributes load by adding a random suffix to your PK.
- Contributor Insights is the best tool for identifying hot keys in real-time.
- Use On-Demand Mode for bursty workloads to avoid manual provisioning headaches.
Frequently Asked Questions
Q. How do I identify a hot partition in DynamoDB?
A. Use CloudWatch Contributor Insights for DynamoDB. It allows you to view the most frequently accessed partition keys and sort keys in your table. If a small number of keys account for the majority of your throttled requests, those are your hot partitions.
Q. Does DynamoDB On-Demand eliminate all throttling?
A. No. On-Demand eliminates throttling at the table level (provisioned capacity), but it does not remove the 1,000 WCU / 3,000 RCU limit per physical partition. You can still experience ProvisionedThroughputExceededException if you target a single key too aggressively.
Q. How many shards should I use for write sharding?
A. Divide your expected peak write throughput by 800 (a safe buffer below the 1,000 WCU limit). For example, if you expect 8,000 writes per second to a single logical group, use at least 10 shards (8000 / 800 = 10).
For more details on advanced NoSQL patterns, check out the Official AWS Partition Key Design Guide or read our related post on Optimizing DynamoDB Query Performance.
Post a Comment