GeoDNS Without the Drama: Lessons from a Four-Node Mesh -- T34ch Tech

We had four nodes. Two in the United States, two in Europe. We needed users to reach the closest one. The obvious answer was GeoDNS -- return different A records based on the client's geographic location. We implemented it with PowerDNS LUA records. It worked. Then it became the single most annoying piece of our infrastructure to maintain.

Six months later we tore it out, replaced it with plain DNS round-robin and a WireGuard mesh with BGP, and our uptime improved. Not because GeoDNS is bad technology. Because GeoDNS at our scale introduced more operational complexity than it solved, and the failure modes were subtle enough to erode our confidence in the system.

This is the story of that transition. It is not an argument against GeoDNS in general. It is an argument for choosing the simplest architecture that meets your actual requirements, and for being honest about what those requirements are.

The Problem GeoDNS Solves

When a user in Paris resolves your domain name, a standard DNS configuration returns the same set of IP addresses regardless of where the query originates. The user's browser connects to one of them. If your servers are in Dallas and Frankfurt, and DNS returns both addresses, the Parisian user has a roughly 50% chance of hitting Dallas and paying 120ms of transatlantic latency on every request.

GeoDNS solves this by looking at the source IP of the DNS query (or, more precisely, the EDNS Client Subnet if present) and returning only the addresses that are geographically close to the client. The user in Paris gets the Frankfurt address. The user in Chicago gets the Dallas address. Latency drops. Everyone is happy.

The concept is straightforward. The implementation is where the complexity lives.

Key term: EDNS Client Subnet (ECS) A DNS extension (RFC 7871) that allows recursive resolvers to include a truncated version of the client's IP address in the query to the authoritative server. Without ECS, the authoritative server only sees the IP of the recursive resolver, which may be in a different country than the end user. Google Public DNS (8.8.8.8) and Cloudflare (1.1.1.1) handle ECS differently -- Google sends it by default, Cloudflare strips it for privacy. This means your GeoDNS decisions are only as good as the subnet information you receive, which varies by resolver.

The LUA Record Approach

PowerDNS supports LUA records, which let you write arbitrary Lua code that executes at query time and returns dynamic DNS responses. This is extremely powerful. It is also extremely easy to shoot yourself with.

Our initial configuration looked something like this:

-- Simplified example of a PowerDNS LUA record
-- Real config had more edge cases

function preresolve(dq)
  local start = dq.remoteaddr:toString()
  local eu_nets = newNMG()
  eu_nets:addMask("2001:41d0::/32")  -- OVH EU
  eu_nets:addMask("91.134.0.0/16")   -- OVH EU
  eu_nets:addMask("57.128.0.0/11")   -- OVH EU

  if eu_nets:match(dq.remoteaddr) then
    dq:addAnswer(pdns.A, "91.134.255.42")   -- eu-01
    dq:addAnswer(pdns.A, "57.129.96.158")   -- eu-02
  else
    dq:addAnswer(pdns.A, "15.204.242.253")  -- us-01
    dq:addAnswer(pdns.A, "135.148.103.47")  -- us-02
  end
  return true
end

This worked for the simple two-region case. Then the requirements grew.

We needed health checks -- if a node was down, the LUA record should stop returning its address. So we added a health-check thread that probed each backend and maintained a state table the LUA code could read. Now we had two interacting systems: the health checker and the DNS resolver, sharing state through a Lua global table protected by a mutex.

We needed to handle the case where ECS was absent. Some resolvers do not send ECS. The query source IP is the resolver itself, not the client. Our MaxMind GeoIP database mapped some resolver IPs to the wrong continent. We added fallback logic: if the GeoIP lookup confidence was below a threshold, return all addresses and let the client sort it out.

We needed to handle IPv6 clients. Our GeoIP database had patchier coverage for IPv6 prefixes. More fallback logic.

We needed to handle the case where an entire region was down. If both EU nodes failed health checks, we should return US addresses to EU clients rather than returning nothing. Cross-region failover logic.

Each requirement was individually reasonable. Collectively, they produced a Lua codebase embedded inside our DNS server that was difficult to test, difficult to reason about, and impossible to observe directly in production. When something went wrong -- when a user in Germany was being routed to Dallas -- the debugging process involved reading Lua code, checking GeoIP database freshness, verifying health check state, and inspecting EDNS Client Subnet propagation. Multiple times, the problem turned out to be a stale GeoIP database that we forgot to update after a monthly release.

The GeoIP Database Problem

Every GeoDNS system depends on a GeoIP database that maps IP addresses to locations. MaxMind GeoLite2 is the most common free option. It is updated every two weeks. It is not accurate for all prefixes. It is particularly unreliable for:

Anycast addresses. Cloudflare, Google, and other CDN resolvers use anycast. The same IP can be announced from multiple continents. GeoIP databases handle these inconsistently.

Mobile carriers. Many mobile networks use centralized NAT gateways. A user in Berlin may exit through a gateway that GeoIP maps to Amsterdam or even to the carrier's US headquarters.

Corporate VPNs. Users behind a corporate VPN exit from the VPN gateway, which may be in a different country. GeoIP sees the gateway location, not the user's location.

Newly allocated prefixes. RIPE, ARIN, and other RIRs assign new address blocks regularly. There is a lag before GeoIP databases incorporate new assignments.

The net effect is that GeoDNS routing is probabilistic, not deterministic. For most users, it works. For a meaningful minority, it does not. And when it does not work, the failure is silent -- the user just gets higher latency and has no idea why.

Fig. 01 -- GeoDNS decision path and failure points

The full decision tree for a GeoDNS query. Each diamond is a failure point: missing ECS, stale GeoIP data, flaky health checks. The client has no visibility into which path was taken -- they only know if the response was fast or slow.

Why We Moved to Round-Robin Plus BGP

The core realization was that our four-node deployment did not need geographic routing. It needed two things: redundancy and reasonable latency. Those are different problems with different solutions.

Redundancy means that if a node goes down, traffic reaches a healthy one. Geographic routing means that traffic always reaches the closest one. At our scale -- four nodes, two continents, a few thousand daily users -- the difference between "closest" and "any healthy node" was 80-120ms of additional latency for users who hit the wrong continent. That latency was noticeable on first page load but irrelevant for the API calls that followed, because we use persistent connections and the TLS handshake cost is amortized.

The question became: is 80-120ms of additional latency for some users worth the operational cost of maintaining GeoDNS? The answer, for us, was no.

The Replacement Architecture

We replaced the LUA GeoDNS records with plain A and AAAA records that list all four node IPs. DNS round-robin distributes queries roughly evenly. The client connects to whichever address its resolver returns first.

Behind those public IPs, we run HAProxy on each node. HAProxy health-checks the application backends on all four nodes over the WireGuard mesh. If the local backend is healthy, HAProxy serves the request locally. If the local backend is down, HAProxy forwards the request to a healthy peer over WireGuard.

BGP runs over the WireGuard mesh between all nodes using BIRD. Each node announces its own /48 service prefixes. If a node disappears from the mesh, its routes are withdrawn and traffic to its service addresses stops being routed through the mesh. This does not affect the DNS-level routing -- the public IP is still in DNS -- but it means the HAProxy on any surviving node can detect the failure via health checks and stop forwarding to the dead peer.

Key term: BGP anycast vs. BGP overlay BGP anycast advertises the same IP prefix from multiple locations. Traffic is routed to the nearest announcement by the global routing table. This is how Cloudflare and other CDNs work. A BGP overlay, which is what we use, advertises unique prefixes from each node into a private mesh. Traffic between nodes follows the overlay routes. We are not doing anycast -- we are using BGP for internal service discovery and failover within our private mesh. The public-facing IPs are still unicast, one per node.

Fig. 02 -- Four-node mesh: WireGuard overlay with BGP

The replacement architecture. DNS is dumb -- it returns all four addresses. HAProxy on each node health-checks the full mesh and forwards to a healthy peer if the local backend is down. BGP over WireGuard provides internal routing and failure detection. No GeoIP database, no LUA records, no decision logic in the DNS path.

WireGuard as the Overlay

WireGuard is the right overlay for this deployment because it is stateless, fast, and trivial to configure. Each node has a WireGuard interface with a unique IPv6 address in the fd53:: ULA prefix. Every node has every other node as a peer with AllowedIPs covering the peer's mesh plane prefixes.

We run three mesh planes per node, each on a separate /48 prefix within the WireGuard tunnel:

:1000::/48 -- the resolver plane. Application traffic between backends.

:1100::/48 -- the database plane. PostgreSQL replication and query traffic.

:1200::/48 -- the storage plane. Iroh QUIC replication for file storage.

Separating the planes means we can apply different firewall rules and monitoring to each traffic class. Database traffic never shares a prefix with application traffic. If we need to isolate the storage plane during maintenance, we drop routes for :1200::/48 without affecting the other two planes.

The WireGuard configuration on each node includes PostUp and PostDown rules that add and remove routes for each peer's three /48 planes. When the WireGuard interface comes up, the routes appear. When it goes down, they disappear. This is the primary routing mechanism -- BGP is a secondary layer that provides convergence and monitoring, not the initial route injection.

Why Not IPsec or OpenVPN

IPsec with IKEv2 would work for the encryption layer, but the configuration complexity is substantially higher. A full-mesh IPsec deployment with four nodes requires twelve tunnel configurations (each pair needs a tunnel in each direction). WireGuard requires four configurations, one per node, each listing three peers.

OpenVPN is not a serious option for this use case. It is userspace, single-threaded, and uses TLS for key exchange, which means it has its own CA and certificate lifecycle to manage. WireGuard is kernel-space, multi-threaded (one thread per CPU), and uses static Curve25519 keys with no CA infrastructure.

BIRD BGP Configuration

BIRD runs on each node with a unique private ASN. The full-mesh BGP topology means each node peers with every other node -- six peer configurations per node in a four-node mesh (twelve in our current seven-node deployment, but the four-node version is easier to reason about).

The critical configuration detail is the export filter. Each node advertises only its own static routes -- the three /48 mesh plane prefixes assigned to that node. It does not re-advertise routes learned from peers. This prevents a class of routing loops where BIRD injects learned routes into the kernel with lower metrics than the WireGuard PostUp routes, blackholing traffic.

protocol static {
    ipv6;
    # Only this node's mesh plane prefixes
    route fd53:0102:1000::/48 unreachable;
    route fd53:0102:1100::/48 unreachable;
    route fd53:0102:1200::/48 unreachable;
}

filter export_own {
    if source = RTS_STATIC then accept;
    reject;
}

protocol bgp peer_us01 {
    local as 64513;
    neighbor fd53:0101:4001::1 as 64515;
    ipv6 {
        import all;
        export filter export_own;
    };
}

The static routes are declared as unreachable because BIRD needs something to announce even when the actual host route is on the loopback interface. The real traffic routing is handled by the WireGuard PostUp routes. BIRD's role is to announce reachability to peers, not to provide the actual forwarding path.

The import filter uses local preference to prioritize routes. Same-site peers get local_pref 220, same-region peers get local_pref 180, and cross-region peers get local_pref 120. This means that if a US node needs to reach a mesh plane address, it prefers other US nodes over EU nodes, which is the correct behavior for minimizing latency on internal traffic.

The export all Disaster

During initial deployment, we configured one node with export all instead of the filtered export. BIRD dutifully re-announced every route it learned from peers, including the unreachable static routes from other nodes. Those routes, injected into the kernel at metric 32, overrode the WireGuard PostUp routes at metric 1024. The result was that the node's kernel routed traffic for remote mesh planes to the unreachable route instead of through the WireGuard tunnel. All inter-node communication died.

The failure was silent from BGP's perspective -- all sessions were Established, all routes were being exchanged. The failure was only visible at the application layer, where every cross-node request timed out. It took us 40 minutes to identify the cause because we were looking at WireGuard, then at firewall rules, then at application logs, before finally checking the kernel routing table and seeing the metric 32 unreachable routes.

The fix was two lines: change export all to export filter export_own. The lesson was more expensive: never use export all in an overlay network where BGP and another routing source (WireGuard, in our case) coexist in the same kernel routing table.

Health Checking and Failover

HAProxy runs on each node and health-checks the application backend on every node in the mesh. The health check is an HTTP GET to the /healthz endpoint on the backend's mesh plane address (the :1000::10 address). If a backend fails three consecutive checks (spaced 5 seconds apart), HAProxy marks it as down and stops routing traffic to it.

The failover path is:

Local backend healthy: HAProxy serves the request from the local backend. Zero additional latency.

Local backend down, remote backend healthy: HAProxy forwards the request over the WireGuard mesh to a healthy peer's backend. Additional latency is the WireGuard hop: 1-2ms intra-region, 80-120ms cross-region.

All backends down: HAProxy returns a 503. This has happened exactly zero times in production. The four-node deployment provides enough redundancy that at least one backend is always available.

This is simpler than the GeoDNS health check integration. HAProxy health checking is a built-in feature with decades of battle testing. It does not depend on GeoIP databases, LUA code, or DNS propagation delays. A backend goes down; within 15 seconds HAProxy stops sending it traffic. A backend comes back; within 10 seconds HAProxy starts sending it traffic again.

Fig. 03 -- Failover path when local backend is down

When the local backend is down, HAProxy forwards over the WireGuard mesh to a healthy peer. Intra-region failover adds 1-2ms. Cross-region failover adds 80-120ms. Either way, the failover is invisible to DNS and happens within seconds, not minutes.

DNS TTL Considerations

With GeoDNS, TTL was a constant tension. Short TTLs (30-60 seconds) meant that GeoIP decisions propagated quickly, but also meant more DNS queries hitting our authoritative servers. Long TTLs (300+ seconds) reduced query load but meant that health-check-driven changes in the LUA records took minutes to reach clients.

With round-robin, TTL does not matter for failover. Failover happens at the HAProxy layer, below DNS. We set our TTL to 300 seconds and stopped worrying about it. If we add or remove a node, the DNS change takes up to five minutes to propagate. That is fine because HAProxy handles the transition immediately. The DNS change is cosmetic -- it removes an IP that HAProxy was already not using.

This is one of the underappreciated benefits of moving intelligence from DNS to the application layer. DNS is a caching system. It is designed for stability, not agility. Every time you put dynamic logic into DNS, you fight the caching layer. Moving the dynamic logic to HAProxy, which does not cache and makes real-time decisions, aligns the architecture with the tools' strengths.

Latency: Before and After

We measured P50 and P95 page load times from synthetic probes in six cities for two months before the migration and two months after.

Before (GeoDNS): P50 was 180ms from European probes and 140ms from US probes. P95 was 420ms from Europe and 310ms from the US. The high P95 was driven by GeoIP misrouting -- about 8% of European queries were being routed to US nodes due to resolver IP misclassification.

After (round-robin + BGP): P50 was 210ms from European probes and 150ms from US probes. P95 was 290ms from Europe and 220ms from the US. P50 increased slightly because some European users now hit US nodes. P95 dropped significantly because the misrouting tail was eliminated -- every user hits a real node, even if it is not the closest one.

The net effect was that median latency got slightly worse, but tail latency got much better. For our use case -- a web application, not a latency-sensitive API -- reducing P95 mattered more than optimizing P50. The users who were being silently misrouted and experiencing 400ms+ loads were the ones most likely to abandon the page. Those users no longer exist.

Key term: Tail latency The latency experienced by the slowest requests (typically measured at P95 or P99). Tail latency often has a larger impact on user experience than median latency because it determines the worst case that real users encounter. A system with P50=100ms and P95=500ms feels worse than a system with P50=150ms and P95=200ms, even though the median is higher in the second case. GeoDNS can improve P50 by routing most users to the closest node, but it can worsen P95 by misrouting a minority of users to the wrong continent.

When GeoDNS Actually Matters

Our experience does not generalize to all deployments. GeoDNS is the right tool when:

You have enough nodes that round-robin produces unacceptable P50. If you have twelve nodes across four continents, round-robin means a 75% chance of hitting the wrong continent. That is not a tail latency problem -- that is a baseline latency problem. GeoDNS or anycast becomes necessary.

Your application is latency-critical below 50ms. Trading APIs, real-time multiplayer games, and live video ingest cannot tolerate 120ms of unnecessary latency on any request. For these workloads, every request must hit the closest node, and the complexity of GeoDNS is justified by the requirement.

You have regulatory requirements for data locality. GDPR, data residency laws, and similar regulations may require that European user data stays in Europe. GeoDNS ensures that European users never touch a US node. Round-robin does not provide this guarantee. Note that this is a compliance requirement, not a latency requirement -- the solution is the same, but the motivation is different.

You can afford the operational overhead. Running GeoDNS well requires maintaining GeoIP databases, writing and testing LUA or configuration logic, building health check integration, handling ECS edge cases, and monitoring for misrouting. If you have a dedicated infrastructure team, this is tractable. If you are a team of three running everything, it is a tax on every other priority.

When It Does Not Matter

For most small-to-medium deployments with fewer than eight nodes, the latency difference between "closest node" and "any healthy node" is smaller than people think. The transatlantic round-trip is about 80ms. For a web application that renders on the client and makes API calls over persistent connections, that 80ms is paid once on the TLS handshake and amortized across the session. The perceived performance difference is negligible for most users.

If your site is primarily serving static content through a CDN, the origin server's geographic location is irrelevant for most requests. The CDN handles locality. GeoDNS on the origin is solving a problem the CDN already solved.

If your deployment is single-region (multiple nodes, one geography), GeoDNS is completely irrelevant. Round-robin with health checking is strictly simpler and equally effective.

The lesson from our four-node mesh is not that GeoDNS is bad. It is that operational complexity has a cost that is easy to underestimate and hard to measure. Every layer of dynamic logic in your infrastructure is a layer that can fail silently, that requires monitoring you probably have not built yet, and that makes debugging harder when something else breaks. Round-robin with a BGP overlay is not as clever as GeoDNS. It is more predictable, easier to debug, and -- for our scale -- produces better outcomes measured by the metric that matters: P95 latency as experienced by real users.

Operational Simplicity as a Design Goal

The deeper lesson here is about choosing simplicity deliberately, not by default. We did not start with round-robin because we were lazy. We started with GeoDNS because it seemed like the right engineering choice. We moved to round-robin after six months of operating GeoDNS and discovering that the operational cost exceeded the latency benefit.

Simplicity is not the absence of engineering. It is the result of engineering that understands its own constraints. Our constraints were: small team, four nodes, moderate traffic, tolerance for 100ms of additional latency. Given those constraints, the simplest architecture that met our requirements was the correct architecture.

If your constraints are different -- large team, dozens of nodes, latency-critical workloads -- your architecture should be different too. The point is not that round-robin is always right. The point is that you should be able to articulate, in concrete terms, what GeoDNS buys you and what it costs you. If the answer to "what does it buy us" is "30ms of P50 improvement for 60% of users," and the answer to "what does it cost us" is "a Lua codebase inside our DNS server, a GeoIP database update process, custom health check integration, and an extra hour of debugging every time something goes wrong," then you have the information you need to make the decision.

We made ours. Six months later, we have not looked back.

The Checklist

For anyone considering the same migration, here is what we did, in order:

Measure first. Before removing GeoDNS, we ran synthetic probes from multiple geographies for two months. We knew our P50, P95, and misrouting rate before we changed anything. This gave us a baseline to compare against.

Build the mesh first. WireGuard and BGP were deployed and stable for three weeks before we touched DNS. We verified full-mesh connectivity, BGP route exchange, and HAProxy health checking across all nodes. The overlay was production-ready before it carried production traffic.

Add round-robin records alongside GeoDNS. We added a test hostname (test.example.com) with all four IPs in round-robin and pointed our synthetic probes at it. This let us measure round-robin latency in parallel with GeoDNS latency for the production hostname.

Switch DNS in one step. Once we were confident in the measurements, we replaced the LUA record with plain A records. The TTL ensured full propagation within five minutes. HAProxy handled the transition transparently.

Monitor for two weeks. We watched P50, P95, error rates, and HAProxy backend health status for two weeks after the switch. P95 improved within the first day. P50 increased by 20-30ms, as expected. Error rates were unchanged.

Remove the LUA code. Only after confirming stable operation did we remove the LUA records, the GeoIP database update cron job, and the custom health check integration from the DNS server. Removing code felt better than adding it.

The total migration took three weeks from first synthetic probe to final cleanup. The hardest part was not the technical work. It was convincing ourselves that "simpler" was not the same as "worse."