I recently handled a production outage where a Spring Boot 3.2 service running on Java 21 started throwing SQLTransientConnectionException during a 5x traffic spike. The application wasn't crashing because of CPU or memory; it was starving because the HikariCP pool was exhausted, and threads were waiting 30 seconds just to get a connection. We found that the default settings, while safe for development, are a bottleneck for high-concurrency systems.
📋 Tested Environment: Spring Boot 3.2.4, PostgreSQL 16, Kubernetes (4 vCPU nodes)
Key Discovery: Setting minimumIdle equal to maximumPoolSize (fixed-size pool) reduced latency spikes by 12% during scale-up events by eliminating the overhead of dynamic connection creation.
When your logs start filling up with HikariPool-1 - Connection is not available, request timed out after 30000ms, your first instinct is usually to "crank up the pool size." This is often a mistake. A massive pool increases context switching on the database side and can lead to disk I/O saturation. We solved this by analyzing the "Wait Time" vs. "Usage Time" metrics.
The Connection Exhaustion Nightmare: Diagnosing the Bottleneck
In our high-throughput environment, the default maximumPoolSize of 10 was hit within seconds of a deployment. However, simply increasing it to 100 caused our PostgreSQL instance to spike to 100% CPU. We had to find the sweet spot. The error log we saw most frequently was this:
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:696)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
This tells us the pool is full and the connectionTimeout is too long. In a high-throughput system, if you can't get a connection in 2-3 seconds, waiting 30 seconds just blocks your application threads and leads to a cascading failure. We reduced our connection-timeout to 5000ms to fail fast and trigger circuit breakers earlier. Check out our guide on implementing Resilience4j circuit breakers to handle these failures gracefully.
Optimizing application.yml for Production Stability
To fix the instability, we moved away from the default dynamic scaling. In production, you want a "fixed-size" pool. This prevents the "thundering herd" problem where HikariCP tries to create 20 connections simultaneously when a burst of traffic hits an idle service. By setting minimum-idle to the same value as maximum-pool-size, the connections are warmed up and ready.
spring:
datasource:
hikari:
# Use a fixed pool size for high throughput
maximum-pool-size: 25
minimum-idle: 25
# Max time a thread waits for a connection (ms)
connection-timeout: 5000
# Prevents "Connection is closed" errors from DB-side timeouts
max-lifetime: 1800000 # 30 minutes
# Vital for identifying leaked connections in logs
leak-detection-threshold: 2000
# Optimization for PostgreSQL/MySQL
pool-name: HighThroughputPool
The leak-detection-threshold was the real hero. We found a specific @Transactional method that was making an external API call while holding a DB connection open. The logs flagged it immediately: Apparent connection leak detected for connection.... Never perform long-running network I/O inside a database transaction.
The Invisible Latency Killers: maxLifetime and I/O
We encountered a strange issue where every 30 minutes, the application latency would jump. This was caused by maxLifetime. If all connections are created at the exact same time (which happens with fixed-size pools), they all expire at the exact same time. HikariCP is smart enough to add a small "jitter" to the expiration, but it wasn't enough for our load.
We solved this by ensuring our max-lifetime was at least 30 seconds shorter than the database's global connection timeout. For instance, if your PostgreSQL idle_in_transaction_session_timeout is set to 1 hour, set your Hikari max-lifetime to 30 minutes. This ensures the application retires connections before the database kills them forcefully, avoiding "Broken Pipe" errors. For further tuning, refer to the official HikariCP sizing guide.
Another factor is the OS level. We saw significant performance gains by tuning the TCP keepalive settings on our Linux containers, ensuring that stale connections weren't sitting in a CLOSE_WAIT state. You can monitor these metrics using the Spring Boot Actuator Prometheus endpoint to see exactly how many connections are 'Active' versus 'Pending'.
Frequently Asked Questions
Q. How do I calculate the perfect maximumPoolSize?
A. Start with (core_count * 2) + effective_spindle_count. For a 2-core DB with an SSD, that's roughly 5-10. However, if your queries are I/O bound (waiting on disk), you may need to increase this. Always test with a load generator like JMeter to find the saturation point where latency starts to climb exponentially.
Q. Why is my leak-detection-threshold not showing anything?
A. The leak-detection-threshold should be set to a value slightly higher than your longest-running legitimate query. If it's set to 0 (default), it's disabled. Set it to 2000ms or 5000ms to catch transactions that are held open too long due to inefficient code or nested network calls.
Tuning HikariCP isn't a "set and forget" task. It requires looking at your database's CPU, your application's thread wait time, and your network latency. By moving to a fixed-size pool and aggressive leak detection, we transformed a jittery, error-prone service into a stable high-throughput backbone.
Post a Comment