lame sysadmin
Weird stuff went on recently on my network;
It started last night while I was fiddling with my lab network - dark witchcraft including ospf, eigrp, nat, acls, ppp/chap, etc. I discarded it as being my laptop still fussing with its gateways and default routes... I was tired and didn't care. So I just turned things off, went to bed and forgot about it.
[side note: I should post my network diagram someday, so everyone can appreciate a true overkill geek home network...]
This morning, though, my wife couldn't reach facebook from her laptop. Whoops, I must have done *something*...
Symptoms were:
- able to ping external hosts;
- unable to access websites;
First things first: Check the squid proxy - seemed ok;
Lazy as I am, I decided to just reboot the routers because I might have left some not-very-well-though-configs there. The routers came back up and the situation got worse! Now I had no comms between the core and uplink routers. WTF?
- Moral of the story #1
- Having bpduguard on switch ports that link to routers is not necessarily safe.
The router was broadcasting BPDUs out its ports triggering bpduguard on the switch which, in turn, disabled all the ports that connect to the router (the ones from which BPDUs were received). This effectively shut the router off the switch.
As a quick & dirty solution, I enabled bpdufilter on those switch ports. Just ignore those BPDUs instead of disabling the ports if one is received (bpduguard is setup globally on the switch, rather than port by port); The proper solution involves shutting down STP on the router instead with no spanning-tree vlan X commands.
Layer 2 issues sorted, I was back to last night's situation. Being a little more pragmatic this time, I disabled wccp redirects from the core router. Sure enough, everything was back to normal. The proxy was good but the connection between the core router and the proxy service/host was not (this connection being a GRE tunnel - did I mention how nice WCCP is??).
- Moral of the story #2
- Be pragmatic, not lazy.
I had a massive update pending on the server including a kernel upgrade. I had to reboot now. Naturally the server did not come back up as expected because the updated version of udev required a kernel >=2.6.27. The latest accessible kernel was 2.6.26 - d'ough!!
So there I was manually creating md nodes on /dev via a serial console to mount and copy the newest available kernel, modify grub's configuration and try again with a new kernel. It worked! Everything came back to normal, the GRE tunnel came back up, as all other services on the server.
- Moral of the story #3
- Don't slack on your sysadmin duties, update often and check that things are still working after updating. Make sure you reboot from time to time to verify that everything is starting up nicely.
Along with all this, I forgot about the new terms of service of editdns.net (my DNS provider) and they cancelled my account. My domain pjvenda.net has been unavailable since 3rd January 2010 and should only be restored after tonight (10th January 2010).
- Moral of the story #4
- Make sure you have a working DNS service for your domains. Otherwise they don't work.
I'm lame sometimes. Must be the cold.
Cheers, PJ