Persistent Connections, HOL Blocking, and HTTP/1.1 Limits

LESSON

013 25 min intermediate

Persistent Connections, HOL Blocking, and HTTP/1.1 Limits

The core idea: HTTP/1.1 persistent connections save setup cost by reusing TCP connections, but each connection still carries one ordered response path, so slow responses and bounded pools create head-of-line pressure.

Core Insight

Imagine a product page that loads HTML, CSS, JavaScript, product images, recommendations, inventory, and a cart summary. In a local demo, every request is quick. In production, one large image is slow and one API response streams for several seconds. The browser appears to "stall" even though the origin is healthy. Some requests wait in the browser. Others wait in a backend connection pool. Operators see open sockets, long response times, and clients retrying.

HTTP/1.1 persistent connections are the first reason this is not simply "one request equals one connection." A client can open a TCP connection to an origin and reuse it for multiple HTTP requests. That avoids paying connection setup over and over, and it lets the connection benefit from already-warmed transport state. For HTTPS, it also avoids repeating the expensive parts of TLS setup for every object.

The non-obvious part is the limit that remains. HTTP/1.1 does not multiplex independent response streams over one connection. On a single connection, the response path is ordered. If a long response body is occupying that connection, another request assigned to the same connection cannot receive its response at the same time. That is the HTTP/1.1 version of head-of-line pressure.

The trade-off is reuse efficiency versus blocked work on a limited set of connections. Reuse reduces handshake cost, socket churn, and server overhead. But when each connection has only one active response path, clients and proxies need connection pools, timeouts, and careful retry behavior. Too few connections create waiting. Too many connections create overload and unfairness. The design question becomes: how much concurrency should this client, proxy, or service create for this origin, and what happens when one response is slow?

Why Persistent Connections Exist

In the simplest mental model, a client could open a new TCP connection for every HTTP request:

request A -> open TCP -> send request -> read response -> close
request B -> open TCP -> send request -> read response -> close
request C -> open TCP -> send request -> read response -> close

That works, but it is expensive. Each new connection has setup latency. It consumes client, server, NAT, proxy, and load balancer state. It can start with conservative transport behavior before the connection has learned anything about the path. With TLS, the setup also includes cryptographic negotiation.

A persistent connection changes the shape:

open TCP once
-> request A -> response A
-> request B -> response B
-> request C -> response C
-> close later

HTTP/1.1 made persistent connections the normal behavior unless a participant says otherwise. A response can include Connection: close to indicate that the connection will not be reused after the current message. Otherwise, a client usually expects it may send another request after the current response is complete.

The win is visible on small objects. A product page may need dozens of assets. Reusing connections avoids a storm of repeated setup work. The server also avoids creating and destroying sockets for every image, stylesheet, or API call. Connection reuse is one reason HTTP/1.1 could support the growing web better than the older one-request-per-connection style.

But reuse is not free. A reused connection is shared state. It has an idle timeout. It can be closed by either side. A proxy can sit between client and origin with its own pool and timeout. A client might try to reuse a socket just as the server has closed it. A server might keep too many idle connections and run out of file descriptors. The performance feature becomes an operational boundary.

The Ordered Response Path

The main HTTP/1.1 limit is easier to see with a timeline.

Suppose a browser has one connection to www.shop.test and three requests:

R1: GET /product/42.html       small HTML
R2: GET /images/hero.jpg       large image
R3: GET /api/cart-summary      small JSON

Without pipelining, the client usually sends a request, waits for its response, then sends the next request on that same connection:

connection 1:
  send R1 -> receive response 1
  send R2 -> receive response 2
  send R3 -> receive response 3

If R2 is a large or slow response, R3 waits behind it on that connection. The JSON response may be tiny, but it cannot be received before the image response finishes if both are serialized through the same HTTP/1.1 connection.

HTTP/1.1 pipelining tried to improve this by allowing multiple requests to be sent before the earlier responses came back:

connection 1:
  send R1, R2, R3
  receive response 1, response 2, response 3 in order

That reduces request-send waiting, but the responses still have to come back in order. If response 2 is slow, response 3 is stuck behind it even if the server could compute response 3 quickly. This is head-of-line blocking at the HTTP message layer: later work waits because earlier work occupies the ordered line.

In practice, browser support for HTTP/1.1 pipelining was limited because of ordered response blocking and unreliable intermediary behavior. Instead, clients opened several connections per origin. That creates parallel lanes:

connection 1: R1 -> response 1
connection 2: R2 -> response 2, slow body
connection 3: R3 -> response 3

Now the cart summary does not have to wait behind the large image. The cost is more connection state. The client, server, and any proxy or load balancer now have to manage multiple sockets for the same origin.

The important boundary sentence is: HTTP/1.1 connection reuse saves setup work, but it does not turn one connection into many independent response streams.

Connection Pools Are Concurrency Controls

Browsers, API clients, reverse proxies, and service clients commonly use connection pools. A pool is a bounded set of reusable connections to a target. The pool answers a concrete question: when code wants to send another request, should it reuse an idle connection, wait for one, or open a new one up to a limit?

For a browser, the target is usually an origin. For a backend service, the target might be an upstream host, load balancer, or proxy. In both cases, the pool limit is a concurrency policy.

Consider a service calling an inventory API:

checkout service
  connection pool to inventory-api: max 20 connections

At normal load, requests reuse idle connections. At high load, all 20 connections may be busy. New calls wait in the pool queue. If one upstream response streams slowly or hangs, it occupies one connection for longer. If several hang, the pool can saturate. The checkout service then sees latency even before the inventory API is fully down.

This is not only a client problem. Servers also have limits: maximum connections, worker threads, event-loop capacity, file descriptors, idle connection memory, and load balancer connection tracking. If every client responds to latency by opening many more connections, the system can amplify the incident.

Good pool behavior is explicit:

max connections per target
max idle connections
idle timeout below or aligned with server timeout
request timeout
response header timeout
body read timeout for streaming cases
retry only when method and operation semantics allow it

The method lessons matter here. Retrying a GET after a stale idle connection fails is often safe. Retrying a non-idempotent POST after an ambiguous timeout can duplicate side effects unless the API has idempotency keys or another safety mechanism. Connection failures and HTTP semantics meet at the retry boundary.

Worked Path: A Product Page Stall

Trace the product page incident.

The page loads these resources:

/product/42.html
/assets/app.css
/assets/app.js
/images/hero.jpg
/api/recommendations?product=42
/api/cart-summary

The browser opens a small pool of HTTP/1.1 connections to www.shop.test:

conn A: /product/42.html -> /assets/app.css
conn B: /assets/app.js
conn C: /images/hero.jpg

The image on connection C is large and slow. Meanwhile JavaScript wants /api/cart-summary. If the browser has already reached its per-origin connection limit, the cart request waits until one connection is available. The page feels slow even though the cart endpoint itself might be fast.

On the server side, a reverse proxy has its own upstream pool:

browser connections
  -> edge proxy
  -> upstream pool to app servers

If the proxy's upstream pool is saturated by slow response bodies, quick API requests can wait behind them before the application even sees the request. The app logs may show normal processing time because the queueing happened in the proxy or client pool. That is why operators need connection-pool and proxy metrics, not just application handler duration.

A useful trace separates the waiting points:

browser queue time
-> connection assigned
-> request sent
-> proxy queue time
-> upstream connection assigned
-> app handler time
-> first byte sent
-> response body duration
-> connection returned to pool

The fix is not automatically "increase all connection limits." That may reduce waiting for one client while increasing load on the origin. Better fixes depend on the cause: serve large static assets from a CDN or separate host, tune pool limits, shorten or isolate slow streaming responses, add timeouts, compress or resize assets, or move to HTTP/2 where multiple streams can share one connection with different trade-offs.

The key is to identify which line is blocked. A slow image body occupying a browser connection is different from an upstream pool waiting for app servers. A stale idle socket that fails on reuse is different from a full pool. All three can look like "HTTP is slow" from the user's chair.

Operational Failure Modes

Failure: infinite or oversized connection pools. Opening more connections can hide blocking briefly, then overload the server, load balancer, or NAT table. Pool size is a concurrency budget, not just a performance knob.

Failure: stale idle connection reuse. A client may keep a socket idle longer than the server or proxy allows. The next request tries to reuse a connection that the other side has closed. Clients need sensible idle timeouts and retry behavior that respects method safety and idempotency.

Failure: measuring only application handler time. If a request waits in a browser, client library, reverse proxy, or upstream pool before application code runs, app logs understate user-visible latency. Measure queue time, connection acquisition time, time to first byte, and body duration.

Failure: mixing slow streams with latency-sensitive calls. One long download or streaming response can occupy a connection and pool slot. Separate hosts, pools, routes, or protocols may be needed when slow bodies and tiny interactive calls have very different latency needs.

Failure: assuming pipelining solves HTTP/1.1 concurrency. Pipelining can send requests early, but responses are still ordered and many intermediaries historically handled it poorly. It is not the same as multiplexing.

Useful signals include connection reuse ratio, active and idle connection counts, pool wait time, connection errors on first write, idle timeout closes, response time to first byte, response body duration, and per-origin connection limits. When user latency rises, these signals tell you whether the bottleneck is setup churn, pool waiting, slow bodies, or server processing.

Connections

The body lesson showed that a slow response body can last much longer than the server's first decision. This lesson shows where that slow body sits: on a finite connection path that may be reused by other requests.

The next lesson explains how HTTP/2 changes the shape by multiplexing many streams over one connection. That removes one HTTP/1.1 limit, but it introduces new shared-fate and flow-control concerns. The point is not that HTTP/2 is "faster" in the abstract; it changes which queues and blockers you have to inspect.

Close the lesson and reconstruct a slow page load from memory: number of origins, connection limits per origin, which responses have long bodies, where requests wait for a connection, and which metrics would prove the waiting point.

Resources

[RFC] HTTP/1.1 RFC 9112
- Focus: Use it for persistent connection behavior, message ordering, transfer coding, and connection management in HTTP/1.1.
[RFC] HTTP Semantics RFC 9110
- Focus: Review method semantics and why retry decisions depend on safety, idempotency, and application contracts.
[DOC] MDN: Connection management in HTTP/1.x
- Focus: Read it for short-lived connections, persistent connections, pipelining, and browser connection behavior.
[BOOK] High Performance Browser Networking: HTTP/1.X
- Focus: Use it for practical latency, connection reuse, browser limits, and head-of-line blocking intuition.

Key Takeaways

Persistent connections reduce repeated TCP and TLS setup cost, but they introduce reusable connection state that must be timed out, pooled, and observed.
HTTP/1.1 does not multiplex independent response streams on one connection; slow earlier responses can block later work assigned to the same ordered path.
Connection pools are concurrency controls, so pool size, idle timeout, request timeout, and retry policy are part of the HTTP design.
Debugging HTTP/1.1 latency requires separating connection acquisition, proxy queueing, handler time, first byte, and body duration.

← Back to HTTP Protocol and Content Delivery

← Back to Distributed Systems

← Back to Learning Hub