For the last few weeks I have been encountering a strange problem with making IP WHOIS queries against the RIPE database, which covers all European IPs.
I first encountered the problem during a routine server upgrade and reboot. Suddenly some of our software that we run on these servers started producing errors saying that WHOIS lookups could not be performed.
After some investigation it transpired that it was only IP WHOIS lookups against the RIPE database that were failing. What is more, it was only happening on a couple of our servers, even though they all sat behind the same shared firewall and are Source NATed to the same public IP.
As time went by I upgraded more servers, and each time the newly upgraded server also started exhibiting the behaivour. Naturally my first thought was that something in the kernel upgrades that had warranted the reboot were to blame.
I began to downgrade some of the servers to their previous kernel versions, but this did not fix the issue. Stranger still some of the servers running the new kernels started working again, but intermittently!
Break out TCPDUMP
To try and understand what was going on I started running tcpdump on the firewall server to try and see the difference between a working server and a non-working server.
The results of a working server looked like this:
13:14:24.291132 IP x.x.x.x.40474 > 18.104.22.168.43: S 1723346221:1723346221(0) win 5840 mss 1460,sackOK,timestamp 3097608306 0,nop,wscale 4
The results of a non-working server looked like this:
12:58:26.886531 IP x.x.x.x.47159 > 22.214.171.124.43: S 9443771:9443771(0) win 5840 mss 1460,sackOK,timestamp 2068177 0,nop,wscale 4
Initially, the the packets looked the same with nothing obviously wrong.
The only thing that was different was that the timestamp of the newly rebooted non-working server packet was much lower than the server that had been running for months and was able to perform WHOIS lookups fine.
Surely this is perfectly acceptable, even behind NAT, because TCP connections use packet sequence numbers, not timestamps to order packets? If this wasn't the case, surely NAT would break things all the time?
As it turns out (after much searching) there is an extension to TCP called PAWS (Protect Against Wrapped Sequences) that is designed to prevent older packets from the same connection interfering with current TCP communication when using high bandwidth and high latency links.
Unfortunately it seems that the RIPE network has PAWS enabled, and it seems when making WHOIS requests from multiple servers behind the same public IP causes packets to be dropped because they have the conflicting combinations of timestamp and sequence numbers.
The resolution to this problem turned out to be very simple, disable TCP timestamps in our outgoing packets.
This means that PAWS cannot operate, and then immediately all the servers were able to perform WHOIS lookups with no problems.
Interestingly, enabling PAWS on your network can potentially introduce a DOS attack vector, by the attacker forging a packet to set a host's timestamp artificially high, and preventing future genuine communication.