Misbehaving DNS – Dario Solera

Large cloud infrastructures like Amazon EC2 and Microsoft Windows Azure share, at least for certain features, a common trait: DNS is used aggressively to quickly provision and deprovision instances/services/load balancers/endpoints.

The trick is to keep DNS records’ TTL values very low. Azure, for instance, keeps TTL at 10 seconds, meaning that the client connecting to the service will refresh its DNS cache every 10 seconds. I’m not entirely sure if the DNS system was designed with this kind of usage in mind, but the net result is that there are some clients that, every once in a while, misbehave and send requests to the wrong IP address – simply because they haven’t updated their DNS cache in time.

In the soon-to-be-dismissed Amanuens system, we treat 404s like any other error: an email is automatically delivered to us, and includes all the important details:

404

This report simply means that a client requested a URL to the IP address that serves Amanuens, which is a load balancer in front of two Azure web role instances. The requested resource was not found.

We see this kind of errors very often, with various requested URLs, user agents, and so on. Most requests are from bots looking for known software with known vulnerabilities, but some are seemingly legit HTTP requests for resources that in no way could have ever been available. I mean, just look at the host header!

Ignoring for a moment that the requested URL returns a 404 even on the real website, I decided to have a quick look and understand what happened in detail.

Assuming that it wasn’t simply buggy, it’s obvious that the client, when it made the request, held an outdated DNS cache, so at first I assumed that sciencefriday.com was hosted on Azure as well. A quick traceroute revealed that sciencefriday.com points to an Amazon EC2 domain.

Traceroute

My knowledge on DNS is pretty solid, and I think that it’s very unlikely that DNS servers returned records that were so wrong.
To the same extent, it’s unlikely that an IP address previously assigned to Microsoft and used in Azure and served Amanuens, was later reassigned to Amazon for use in EC2.
Similarly, I don’t think it’s probable that routers across the web failed to route the data to the appropriate destination. Although outdated/corrupted routing tables might cause exactly that, we’d have probably seen more invalid requests, not just one.
Some kind of malware could have hijacked the client machine and re-routed requests, but then why point to a perfectly legit website?

In this case, given the user agent, I suspect it’s something a broken client, but in general it’s a very interesting phenomenon that’s very hard to figure out, let alone fix. The fault could simply be anywhere, and the more I think about it, the more confused I get. Perhaps there’s a trivial explanation that I’m missing.

The scary part is that a request made, or routed, to the wrong server causes potential information disclosure to a random 3rd party.