Advanced Sysadmin Troubleshooting: Slow Websites
Oh boy it's been an intense 4 weeks. I was pulled in to assist with troubleshooting several major failures at work with loosely related systems. Including my own. All systems are public facing, hence the intensity.
This article attempts to capture a high level view of one of the problems and the methods and tools used (not how to use the tools, that's for another time) to try and solve it. I'm not putting all the details in. Some details are emphatically not appropriate, other details won't increase understanding of the methodology and process taken.
In the interest of helping other sysadmins, I've linked to the various tools used. Some are well known, others more obscure.
- Two unrelated websites run "slow" from various locations around the 'Net. BUT not all locations. The "slowness" is also highly erratic.
- One is Domino on Win2K3, the other Apache on Solaris. We have limited ability to troubleshoot at/with the Win2K3 servers - all serverside is thus done on the Solaris boxen.
- Both systems are hosted off the same DMZ firewall, albeit on different ports.
- Internal monitoring (very extensive) failed to notice the slowness - both from multiple types and locations of http ping equivalents AND from Apache logging of the time to send a request.
Let the hunt begin!
The obvious things are tried:
- Sniff from multiple locations, servers, clients and so on. (wireshark . tcpdump , snoop)
- Ask others to trial from their locations and note results (Thanks SAGE-AU!)
No joy, but we get a better idea of the problem set. Some places work perfectly (Rockhampton!!!), others don't.
Around this point in time it is noted that the problem is far more widespread than first suspected. All other sites tried, hosted at the same location, exhibit the same symptoms. Generally not as bad as the first two. Thoughts focus on Routes/Upstream et al.
Spend a bit of time checking for funky networking stuff - dealing with pretty heavy duty firewalls, so all sorts of edge cases could be at work.
The sniffs taken do show lots of dropped packets. Eventually with more detailed, both end sync'ed, captures, it is noted that the Time Sequence Graphs (tcptrace/xplot) show packets arriving at the client/browser side "out of order". Ouch! This in turn "imposes" a filled TCP Window at the client end. Hence the slowdown.
A mate (via SAGE-AU ) who just happens to work for my ISP offers his assistance. He modifies the routes used for their clients in an attempt to bypass a possibly faulty network circuit. No joy. But Internode scores major brownie points at work - keeping in mind I'm just a residential customer; work has zero financial stake with em. You can't pay to get quality service like that.
An effort to capture more "flow" information, vs individual packets, is undertaken. (argus) Hopefully will provide more useful information. It does. Flow analysis suggests that anywhere from a fifth to a quarter of all principal users would be impacted by this problem. Quadruple ouch.
Start playing the numbers back and forth. How many retransmits, histograms and so on. (gawk, R)
Then it struck me - another of our monitoring programs (orca) tracks all sorts of network counts at a low level - including retransmits. Identified! We are able to observe when the problem started. Unfortunately it coincided with a failed upgrade to the orca collector (division by zero doesn't work too well. Oops. My Bad. Murphy Rulez.) So we can only identify a 3-4 day period when the problem started. Otherwise we'd be able to tie down to a far more detailed level. Bummer. BUT! We still have a line in the sand. The DMZ hoster is advised and starts checking change control.
The very next morning (Friday, 28th of September), looking at the same retransmit graphs, albeit for a 10 day period, I immediately note that the problem went away about 36 hours previously. WOOOOOT!
Immediately run the same tests that showed the problem over and over and over again? All gone. Life is wonderful.
Sadly, we have no idea why. Something broke, and something fixed. At least we now know what to look for if it recurs.
Welcome to the joy of being a System Administrator .