LESSON
Day 255: Performance Bottlenecks - Lock Contention & I/O Wait
Many systems are not slow because they are busy computing. They are slow because too much work spends too long waiting.
Today's "Aha!" Moment
The insight: Once profiling and flame graphs stop pointing at obvious CPU hotspots, the next suspects are often waiting paths: threads blocked on locks, requests parked behind queues, workers stalled on disk or network I/O, or runtimes thrashing around shared state.
Why this matters: Teams often equate performance with CPU efficiency. That is incomplete. A service can look only moderately busy in CPU metrics while users still see terrible latency because work is piling up behind a mutex, a connection pool, a filesystem call, or a synchronous network hop.
The universal pattern: concurrent work competes for a shared resource -> work queues or blocks -> latency rises and throughput flattens -> adding more concurrency often makes the bottleneck worse instead of better.
Concrete anchor: A worker fleet is underperforming. CPU stays at 35%, so everyone assumes more threads will help. Lock profiling reveals that most workers are repeatedly blocked on one shared in-memory map and a database connection pool. The system is not compute-starved. It is coordination-starved.
How to recognize when this applies:
- Latency rises without proportional CPU saturation.
- Throughput plateaus while concurrency keeps increasing.
- Profiles show blocking, off-CPU time, mutex wait, or syscall-heavy paths.
Common misconceptions:
- [INCORRECT] "Low CPU means the service has plenty of headroom."
- [INCORRECT] "If more threads do not help, the code must just be inefficient."
- [CORRECT] The truth: Lock contention and I/O wait often hide behind moderate CPU usage because the lost time is spent waiting, not executing.
Real-world examples:
- Lock contention: Shared caches, global maps, allocators, or logging paths become serialization points under load.
- I/O wait: Disk reads, network calls, DNS lookups, queue brokers, and database round trips dominate request latency even when the application code itself is cheap.
Why This Matters
The problem: Waiting bottlenecks often look deceptively harmless in standard dashboards. CPU is not maxed, memory looks acceptable, and yet the user experience is poor because the system is serialized around one scarce dependency or one shared coordination path.
Before:
- Engineers chase CPU micro-optimizations while requests are really blocked on locks or I/O.
- Adding concurrency increases queue depth and contention instead of throughput.
- The same bottleneck reappears after every "optimization" because the shared resource was never addressed.
After:
- Waiting is treated as a first-class performance cost.
- Bottlenecks are mapped to the contested resource, not just to the visible function.
- Optimization targets resource ownership, queueing, batching, and concurrency shape.
Real-world impact: Understanding lock contention and I/O wait is often the difference between "the system degrades mysteriously under load" and "we know exactly which resource saturates first and why."
Learning Objectives
By the end of this session, you will be able to:
- Explain why waiting bottlenecks matter - Distinguish CPU saturation from time lost on locks, queues, and I/O.
- Describe the common shapes of lock contention and I/O wait - Recognize serialization points, pool exhaustion, blocking syscalls, and queue amplification.
- Choose better mitigation strategies - Reduce waiting by reshaping concurrency, ownership, batching, locality, and dependency behavior instead of blindly adding workers.
Core Concepts Explained
Concept 1: Lock Contention Is Hidden Serialization
Locks exist to protect invariants, but under load they also become throughput governors.
The key mental model is:
- a contended lock turns nominally concurrent work into serialized work
This often happens in places teams underestimate:
- a global cache or registry
- an allocator or logging subsystem
- a hot queue protected by one mutex
- per-process connection bookkeeping
- data structures with coarse-grained locking
Symptoms usually look like this:
- CPU is not fully saturated
- latency rises sharply with concurrency
- throughput flattens or degrades
- lock or mutex profiles show concentrated wait time
The important insight is that lock contention is not just "some overhead." It changes the shape of the system.
If 100 workers all need the same lock regularly, then:
- the useful parallelism of those workers is much smaller than 100
This is why the best fixes often are not tiny code tweaks, but structural changes:
- shrink the critical section
- move expensive work outside the lock
- shard the protected state
- replace one hot shared structure with per-thread or per-core ownership
- reduce needless write frequency
The wrong fix is often:
- "add more workers"
because extra workers can simply create a longer line at the same serialized gate.
Concept 2: I/O Wait Is Latency Imported From Elsewhere
I/O wait means the process is stalled while some external component or kernel-managed resource catches up.
Examples:
- waiting for disk
- waiting for network responses
- waiting for DNS
- waiting for a database or broker
- waiting for the kernel to signal readiness on sockets or files
This matters because application code can be perfectly reasonable while the end-to-end path is still slow.
The practical model is:
- I/O-heavy systems spend much of their time negotiating with external reality
That means the right questions are often:
- how many round trips are we paying?
- how much concurrency can the dependency sustain?
- are we issuing work serially when we could pipeline or batch it?
- are we blocking threads that could have been reused?
This is where different system designs diverge sharply:
- synchronous per-request I/O can be simple but expensive in thread usage
- async or evented designs reduce blocked-thread cost but do not remove dependency latency itself
- batching reduces round trips but can raise tail latency if overused
So the performance problem is often not "our code is slow," but:
- "our code waits too often, too long, or with the wrong concurrency shape"
Concept 3: Queueing Turns Small Delays Into Large Latency
The most important bridge between lock contention and I/O wait is queueing.
A single slow shared resource creates:
- backlog
- head-of-line blocking
- burst amplification
Examples:
- a mutex causes threads to queue in-process
- a connection pool causes requests to queue before a DB call
- a disk or broker causes operations to queue in the kernel or dependency
This is why moderate increases in load can suddenly produce severe latency jumps.
The system is not degrading linearly. It is moving from:
- "most work gets immediate service"
to:
- "most work waits in line before service even begins"
That has two important consequences:
- Tail latency often explodes first
- extra concurrency can worsen the queue instead of draining it
This links directly to the previous lessons:
- profiling identifies whether the resource question is CPU or waiting
- flame graphs help visualize where cost accumulates
- this lesson explains why the next practical step is often to inspect shared resources, pools, locks, and blocking boundaries
And it also prepares the case-study lesson:
- real optimization work usually combines caching, profiling, flame graphs, and queue/lock analysis instead of treating them as separate subjects
Troubleshooting
Issue: "CPU usage is low, so the system should have plenty of capacity."
Why it happens / is confusing: CPU is the most visible utilization number, so teams over-trust it.
Clarification / Fix: Check mutex contention, pool wait time, off-CPU profiles, queue depth, and blocking syscalls. Low CPU can coexist with severe lock or I/O bottlenecks.
Issue: "We doubled worker count, but throughput barely improved."
Why it happens / is confusing: More concurrency feels like the obvious answer.
Clarification / Fix: Look for a serialized resource: one hot lock, one database pool, one queue, one downstream dependency. More workers may just create a longer queue at the same bottleneck.
Issue: "The flame graph is not showing a giant CPU hotspot, so we do not have a performance problem."
Why it happens / is confusing: People implicitly assume performance problems must be CPU-shaped.
Clarification / Fix: Use lock, block, or off-CPU profiling views. Waiting problems often dominate latency while barely appearing in standard CPU-centric flame graphs.
Advanced Connections
Connection 1: Performance Bottlenecks <-> Flame Graphs
The parallel: Flame graphs become much more informative once you know whether the width represents CPU work or waiting work. The same visual idea can expose contention or blocking if the profile source is right.
Real-world case: A CPU flame graph may look modest while an off-CPU flame graph shows huge width under filesystem reads or a single shared mutex path.
Connection 2: Performance Bottlenecks <-> Optimization Case Studies
The parallel: Real optimization work usually succeeds only after identifying which waiting resource dominates first, then choosing a structural fix instead of a cosmetic micro-optimization.
Real-world case: A system that looks "database-bound" may in fact be dominated by pool waits or lock contention in the ORM layer above the database itself.
Resources
Optional Deepening Resources
- [DOCS] Go
pprofprofile types- Link: https://pkg.go.dev/runtime/pprof
- Focus: Use it to understand how CPU, heap, block, mutex, and goroutine profiles answer different resource questions.
- [DOCS] Linux
futex(2)man page- Link: https://man7.org/linux/man-pages/man2/futex.2.html
- Focus: Read it to connect high-level lock contention back to the kernel primitive that often underlies thread waiting.
- [DOCS] Python
selectorsmodule- Link: https://docs.python.org/3/library/selectors.html
- Focus: Use it as a simple reference for readiness-based I/O multiplexing and why non-blocking designs reduce wasted thread waiting.
- [ARTICLE] Brendan Gregg: Systems Performance
- Link: https://www.brendangregg.com/perf.html
- Focus: Treat it as a high-signal reference for diagnosing off-CPU time, queueing, blocking, and scheduler effects.
Key Insights
- Waiting is a real performance cost - Time lost on locks, pools, queues, and I/O can dominate latency even when CPU looks fine.
- More concurrency is not always more throughput - Once work queues behind a scarce shared resource, extra workers may just amplify contention.
- The best fixes are usually structural - Sharding ownership, shrinking critical sections, batching I/O, and reshaping concurrency often beat micro-optimizing hot code.