How to Fix Apache Spark OOM Errors During Shuffle Operations

Apache Spark OutOfMemory (OOM) errors during shuffle operations are the most frequent cause of pipeline failure in production big data environments. When you see an ExecutorLostFailure or a java.lang.OutOfMemoryError: Java heap space during a wide transformation like join() or groupBy(), your application is struggling to move data between nodes. This typically happens because the default partition count is too low for your dataset size, or a few specific partitions have grown too large for the executor memory to process.

The immediate solution for most Spark 3.x applications is to enable Adaptive Query Execution (AQE) and significantly increase the spark.sql.shuffle.partitions value. By decoupling the static partition count from the actual data volume, you allow Spark to dynamically coalesce or split partitions, preventing the memory saturation that leads to node crashes. Using a value of 200 (the default) is almost never sufficient for production workloads exceeding 100GB of raw data.

TL;DR — Enable Adaptive Query Execution using spark.sql.adaptive.enabled=true and set spark.sql.shuffle.partitions to at least 2x-3x the number of available cores in your cluster to distribute memory load evenly.

Symptoms of Shuffle Memory Exhaustion

💡 Analogy: Imagine a mail sorting facility where all packages for a specific city are forced into a single, small bin. If that city is popular, the bin overflows, the floor gets cluttered, and the entire facility stops working because workers can no longer move. In Spark, that "bin" is a partition in the executor's heap memory.

In Spark 3.5.0 and earlier versions, you will recognize shuffle-related memory issues through specific log signatures. Look for FetchFailedException or Connection reset by peer in your driver logs. These often mask the underlying OOM error; the executor actually died because it exceeded its memory limit while trying to buffer incoming shuffle blocks. When an executor crashes, the driver loses the connection, leading to the "reset" message. You will also see Exit Code 137, which indicates the OS Out-of-Memory Killer terminated the process because it exceeded the container's physical memory limit.

Another subtle symptom is excessive "Shuffle Write Time" and "Shuffle Read Time" in the Spark UI. If your tasks are spending 80% of their duration in shuffle phases before eventually failing, you are likely experiencing disk spilling. While Spark tries to spill data to disk when memory is tight, the overhead of serialization and I/O eventually slows the system to a crawl, often leading to heartbeat timeouts that the driver interprets as a node failure.

Root Causes of Spark Shuffle OOM

The primary technical reason for shuffle OOM is Data Skew. Data skew occurs when a specific key (e.g., a "NULL" value or a very common "City ID") contains significantly more records than other keys. During a join or aggregation, Spark sends all records with the same key to the same partition. If one partition receives 5GB of data while others receive 50MB, the executor assigned to that 5GB partition will run out of heap space, regardless of how many executors you have in the cluster.

A second major cause is an Insufficient Shuffle Partition Count. By default, spark.sql.shuffle.partitions is set to 200. If you are processing a 2TB dataset, each of those 200 partitions will be roughly 10GB. Since Spark processes partitions in memory, trying to fit a 10GB partition into a 4GB or 8GB executor heap is impossible. This forces constant disk spilling or an immediate crash. Even without skew, large datasets require a much higher partition count to keep individual partition sizes manageable—ideally between 128MB and 200MB per partition.

Large Broadcast Joins

Sometimes, the OOM happens on the driver rather than the executor. If you use broadcast() hints on a table that is larger than the spark.sql.autoBroadcastJoinThreshold, or if the driver memory is too low to collect the results of a broadcasted table, the driver will crash. While broadcasting avoids a shuffle, it places the entire dataset into the memory of every single executor and the driver, which is a high-risk operation for medium-to-large dimension tables.

How to Fix Shuffle OOM Errors

To resolve these issues, you must implement a combination of configuration tuning and code-level changes. Start by enabling Adaptive Query Execution (AQE), which is the most powerful feature in modern Spark for handling shuffle issues. AQE can automatically change the number of shuffle partitions at runtime based on the actual size of the data mapped to those partitions.

// Enable Adaptive Query Execution in Spark 3.x
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

// Increase base shuffle partitions for large datasets
spark.conf.set("spark.sql.shuffle.partitions", "1000")

If AQE alone doesn't fix the issue, you likely have severe data skew. In this case, use Salting. Salting involves adding a random integer to your join key to break a single large partition into multiple smaller ones. This forces the data to be distributed across more executors, effectively bypassing the memory bottleneck on a single node.

import org.apache.spark.sql.functions.{concat, lit, floor, rand}

// Add a salt column (0-9) to the skewed dataframe
val saltedDf = skewedDf.withColumn("salt", floor(rand() * 10))
  .withColumn("salted_key", concat(col("original_key"), lit("_"), col("salt")))

// Explode the smaller lookup table to match the salt values
val expandedLookupDf = lookupDf.withColumn("salt_array", array((0 until 10).map(lit): _*))
  .withColumn("salt", explode(col("salt_array")))
  .withColumn("salted_key", concat(col("original_key"), lit("_"), col("salt")))

// Perform the join on the salted key
val result = saltedDf.join(expandedLookupDf, "salted_key")

⚠️ Common Mistake: Increasing spark.executor.memory without increasing spark.sql.shuffle.partitions. While more memory helps, it doesn't solve the root problem of uneven data distribution. You will eventually hit the same limit as your data scales.

Verifying the Fix in Spark UI

Once you have applied the configurations, monitor the **Stages** tab in the Spark Web UI. Check the "Shuffle Read Size / Records" column for the stage where the OOM occurred. If your fix worked, you should see the maximum partition size significantly reduced. For example, instead of one task reading 4GB and others reading 10MB, you should see a more uniform distribution where most tasks read around 100MB-200MB.

Check the **SQL** tab and click on the specific query execution plan. Look for a node labeled `AdaptiveSparkPlan`. If AQE is working correctly, you will see a sub-section titled `AQE Persistence`, showing that partitions were coalesced. You should also verify that the "Max Memory Used" metric for your executors stays well below the 80% threshold of your allocated JVM heap. If the memory usage still spikes to 95%+, you may need to reduce spark.memory.fraction to allow more space for user objects versus shuffle data.

Prevention and Best Practices

To prevent OOM errors in the future, adopt a proactive approach to resource allocation. Always calculate your partition count based on the formula: Shuffle Partitions = (Target Data Size) / (Target Partition Size). If you expect 500GB of shuffle data, and you want 200MB partitions, set your shuffle partitions to 2,500. This ensures that even without AQE, the baseline workload is distributed enough to fit in standard executor memory profiles.

Additionally, avoid using repartition() unless absolutely necessary before a write. While repartition() can fix skew, it triggers a full shuffle itself. If you only need to reduce the number of partitions for a small file problem, use coalesce() instead, as it avoids a full shuffle by merging existing partitions. When I implemented this for a 50TB pipeline, we reduced the cluster cost by 30% simply by switching from static partitioning to AQE with a high minPartitions floor.

📌 Key Takeaways:

Enable AQE (Adaptive Query Execution) for dynamic partition management.
Set spark.sql.shuffle.partitions much higher than the default 200.
Use Salting to break up skewed keys that cause single-executor crashes.
Monitor the Spark UI to ensure shuffle blocks are within the 100MB-200MB range.

Frequently Asked Questions

Q. Why does Spark run out of memory during a join?

A. This usually happens because of data skew or insufficient shuffle partitions. When Spark performs a join, it moves all records with the same key to a single partition on one executor. If that key is over-represented, the partition grows too large for the executor's heap memory, causing an OOM crash.

Q. How many shuffle partitions should I use in Spark?

A. A good rule of thumb is to aim for partition sizes between 128MB and 200MB. For most production workloads, set `spark.sql.shuffle.partitions` to at least 2 to 4 times the number of CPU cores available in your cluster to maximize parallelism and prevent memory bottlenecks.

Q. What is the difference between repartition and coalesce?

A. Repartition creates new partitions and performs a full shuffle, which is expensive but results in evenly sized partitions. Coalesce decreases the number of partitions without a full shuffle by merging existing ones, making it much faster but potentially leading to uneven partition sizes if used aggressively.