DNS Resolution for Web Delivery
LESSON
DNS Resolution for Web Delivery
The core idea: DNS is the first routing decision in many web requests, but it is a cached, delegated answer system, so changing a record updates future lookups, not every client that already learned an old answer.
Core Insight
Imagine the checkout service for shop.test has two public entry points: a primary edge in one provider and a backup edge in another. During an incident, the primary edge starts returning errors. The team changes the DNS record for api.shop.test to point at the backup. Ten minutes later, dashboards look better, but support still receives complaints from users reaching the broken path.
The surprising part is that DNS did update. The authoritative DNS provider is now serving the new answer. Some clients resolve the name again and move to the backup. Other clients, or more precisely the recursive resolvers they use, still have the old answer in cache. The system is behaving according to the contract it was given: "this answer may be reused for its TTL."
DNS looks like a simple phone book only from far away. In web delivery it is closer to a distributed cache in front of routing. A browser asks the operating system. The operating system asks a recursive resolver. The resolver may ask root, top-level domain, and authoritative name servers. Each layer can cache results. The answer may be a direct A or AAAA record, or it may be a chain of CNAME aliases that eventually leads to addresses controlled by a CDN or load balancer.
The trade-off is fast steering versus cache stability. Low TTLs make planned changes and failover converge faster, but they increase DNS query volume and can make clients depend more often on resolver availability. High TTLs reduce lookup load and smooth normal traffic, but they make emergency moves slower. A mature delivery design treats DNS as a time-delayed control plane, not as an instant switch.
The Visible Pieces
A client does not usually ask an authoritative DNS server directly. The usual path is:
browser or app
-> operating system stub resolver
-> recursive resolver
-> root servers
-> TLD servers for .test
-> authoritative servers for shop.test
The stub resolver is the small local component that knows how to ask a recursive resolver. The recursive resolver does the work of chasing the answer. It asks other DNS servers, follows delegations, caches what it learns, and returns an answer to the client. The authoritative server is the source of truth for a zone such as shop.test.
For a hostname, the answer may be direct:
api.shop.test. 300 IN A 198.51.100.10
api.shop.test. 300 IN AAAA 2001:db8::10
Or it may delegate the final delivery decision to another name:
api.shop.test. 300 IN CNAME api.global-edge.example.
api.global-edge.example. 60 IN A 198.51.100.10
That CNAME means "the canonical name is somewhere else; continue resolving there." This shape is common with CDNs and managed load balancers because the product vendor can change the final address without asking the site owner to edit every zone record. It also means there may be several TTLs in the chain. A resolver can cache the alias and the final address separately.
TTL means time to live. It is the duration, in seconds, for which a resolver may reuse an answer before it should ask again. If an authoritative server returns:
api.shop.test. 300 IN A 198.51.100.10
then a recursive resolver can return that same answer from cache for up to about five minutes. When it returns the cached answer, the remaining TTL counts down:
first answer: 300 seconds remaining
after two minutes: 180 seconds remaining
after five minutes: expired, ask again
The important boundary is that TTL controls cache reuse after an answer is learned. Lowering the TTL after resolvers already cached a record does not shorten the copies they already hold. If you want a fast planned migration, lower the TTL before the migration and wait for the previous higher TTL to age out.
What Happens During Resolution
Trace one fresh lookup for api.shop.test when nothing is cached:
1. Browser needs https://api.shop.test/pay.
2. OS stub asks recursive resolver: "address for api.shop.test?"
3. Recursive resolver asks a root server: "who handles .test?"
4. Root points it to .test TLD servers.
5. Resolver asks a .test TLD server: "who handles shop.test?"
6. TLD points it to authoritative servers for shop.test.
7. Resolver asks the authoritative server: "address for api.shop.test?"
8. Authoritative server returns a CNAME and address records.
9. Resolver caches the records and returns the answer to the client.
10. Client connects to the returned address and starts TLS/HTTP.
DNS has not delivered the HTTP request. It only gave the client a starting address. The HTTPS edge from the previous lessons still has to present the right certificate, negotiate protocol, and forward the request through proxies. DNS chooses where the client begins that path.
Now replay the same lookup while the resolver has a cached answer:
1. Browser needs https://api.shop.test/pay.
2. OS stub asks recursive resolver.
3. Recursive resolver finds api.shop.test in cache.
4. Resolver returns the cached address with remaining TTL.
5. Client connects to that address.
The authoritative server is not consulted. That is the point of caching. It reduces latency and load. It also means the authoritative view and the client-visible view can differ for a while.
There is another cache that is easy to forget: the local machine or application runtime. Browsers, operating systems, JVMs, mobile SDKs, and service clients may keep DNS answers briefly. Some respect DNS TTLs closely; some apply their own caps or floors. The practical conclusion is not "TTL is useless." It is "TTL is a convergence hint with multiple caches in the path."
Worked Path: The Failover That Half Worked
The shop team starts with this record:
api.shop.test. 300 IN CNAME primary.edge.example.
primary.edge.example. 300 IN A 198.51.100.10
At 12:00, a recursive resolver used by many customers asks for api.shop.test and caches the result. At 12:02, the primary edge starts failing. At 12:03, operators change the authoritative record:
api.shop.test. 300 IN CNAME backup.edge.example.
backup.edge.example. 60 IN A 203.0.113.20
The authoritative DNS system now says backup. A resolver that did not recently ask for api.shop.test will get the new answer. But the resolver that cached the old answer at 12:00 can keep returning it until roughly 12:05. Users behind that resolver may continue to hit 198.51.100.10.
The timeline looks like this:
12:00 resolver A caches primary answer, TTL 300
12:02 primary edge fails
12:03 authoritative DNS changed to backup
12:03 resolver B has no cache, asks now, gets backup
12:04 resolver A still has 60 seconds left, returns primary
12:05 resolver A cache expires, asks again, gets backup
If the old TTL had been one hour, the incident would last much longer for some clients. If the TTL had already been 30 seconds before the incident, convergence would be faster, but the system would normally send more DNS traffic to the authoritative provider and depend more frequently on resolver freshness.
Now add a second problem. Someone accidentally deletes the api.shop.test record for a minute, causing resolvers to receive a negative answer such as NXDOMAIN or "no such name." Negative answers can also be cached, controlled by negative caching rules from the zone's SOA data. After the record is restored, some clients may still see the name as missing until that negative cache expires.
That is why DNS incident response has two questions, not one:
What does the authoritative server answer now?
What are recursive resolvers and clients still allowed to remember?
Both answers matter. Looking only at the authoritative provider can make the system look fixed before users have converged.
Steering, Aliases, and What DNS Cannot Know
DNS steering can choose different answers for different resolvers. A managed DNS service might use health checks, geography, latency measurements, weighted records, or failover policy. A CDN may return an edge address close to the resolver. This is useful because DNS happens before connection setup and can keep clients away from obviously bad regions.
The limit is that DNS usually sees the recursive resolver, not the exact end user. If many users share a resolver, the DNS service may steer them based on resolver location. EDNS Client Subnet can expose part of the client network to improve locality, but it has privacy and cache-efficiency trade-offs, and it is not universal. For most application design, the safe model is: DNS can steer groups approximately, not route each HTTP request with full context.
DNS also cannot know whether a specific future HTTP operation will succeed. A health-aware DNS policy may know that a region's edge is reachable. It may not know that /pay is failing because a database dependency behind that edge is broken. That is why DNS failover should be paired with load-balancer health checks, HTTP-level telemetry, and application readiness signals.
Multiple A records are not a complete failover strategy either:
api.shop.test. 300 IN A 198.51.100.10
api.shop.test. 300 IN A 203.0.113.20
Some clients may try one address and then another. Some may stick to the first returned address. Some resolvers rotate order. Some runtimes cache their chosen result. If one address is broken, "there is another address in DNS" does not guarantee smooth failover. Test the client behavior that matters to your system.
Operational Failure Modes
Failure: changing DNS during an incident and expecting instant movement. Existing resolver caches can keep old answers until TTL expiry. For planned migrations, lower TTL ahead of time and wait for the old TTL window before switching.
Failure: measuring only authoritative DNS. dig @authoritative-server api.shop.test shows what the source of truth says now. It does not show what large public resolvers, enterprise resolvers, mobile networks, or application runtimes still have cached.
Failure: CNAME chains with mismatched ownership. A site-owned name may point at a vendor-owned name. The final answer's TTL, health, and behavior may be controlled elsewhere. During incidents, know which team or provider owns each link in the chain.
Failure: negative caching after a bad delete. A short accidental NXDOMAIN can live longer than the edit itself. Monitor missing-name responses and know the negative caching policy of critical zones.
Failure: assuming DNS equals load balancing. DNS can choose a starting address. It cannot replace request-level routing, proxy health, retries, circuit breaking, or observability at the HTTP edge.
Useful signals include authoritative answers, recursive resolver answers from several networks, remaining TTL, CNAME chain, A and AAAA differences, NXDOMAIN or SERVFAIL rates, DNS query latency, health-check state behind DNS policy, connection success by returned address, and HTTP errors grouped by edge or region.
Readiness Check
Close the lesson and reason through one hostname you care about:
hostname:
authoritative zone owner:
record type:
CNAME chain:
TTL at each step:
normal target:
backup target:
who can change it:
which resolvers you would test:
what HTTP signal proves users moved:
Then answer this without rereading: if you changed the authoritative answer now, which users could still reach the old target, and for how long? If the answer is "I do not know," the next action is not a bigger DNS dashboard. It is to trace the actual lookup path and cache boundaries.
Connections
The previous lesson covered proxies and header trust after a request reaches your edge. DNS sits before that path: it chooses which edge address the client tries first.
The next lesson on CDNs builds on the same idea. CDN behavior often starts with DNS steering, then continues with edge cache keys, origin shielding, and request-level policy. DNS gets the client to an edge; CDN policy decides what happens once the request arrives there.
Resources
- [RFC] Domain Names: Concepts and Facilities RFC 1034
- Focus: Use it for the delegation model, resolvers, authoritative servers, and the basic DNS architecture.
- [RFC] Domain Names: Implementation and Specification RFC 1035
- Focus: Use it for record formats, query behavior, and the wire-level model behind lookups.
- [RFC] Negative Caching of DNS Queries RFC 2308
- Focus: Read it to understand why missing-name answers can persist after a bad change is fixed.
- [RFC] DNS Terminology RFC 8499
- Focus: Use it as a glossary for stub resolvers, recursive resolvers, authoritative servers, zones, and caching terms.
- [DOC] Google Public DNS: Performance Benefits and Caching
- Focus: Use it for a practical view of recursive resolver caching and lookup latency.
Key Takeaways
- DNS resolution is a delegated, cached lookup path; the authoritative answer is only one part of what users experience.
- TTL controls how long resolvers may reuse an answer after learning it. Lowering a TTL later does not recall already cached answers.
- DNS failover moves future lookups faster when TTLs and resolver behavior allow it, but it is not an instant switch for every client.
- Debug DNS delivery by comparing authoritative answers, recursive resolver caches, remaining TTLs, CNAME chains, and HTTP outcomes by returned address.