The joys of hardware RAID
Hardware RAID
After having used software RAID on Linux for longer than I'd care to admit, I decided to go business and get a proper RAID controller. I mean having a decent motherboard with a bunch of unused bandwidth (2 channel PCI-X bus), it seemed only fair to make use of it.
I was primarily looking for a good SATA-II PCI-X controller with more than 4 ports. The short list came down to LSI Logic Megaraid 300-8x, Adaptec 2820SA and 3Ware 9550SX-8. Availability and cost end up being the same thing in this case. Most can be bought new but they are extortionately expensive. Alternatively there's the 2nd hand market on ebay... but few cards of this type are there. Eventually I got the 12 port version of the 3ware card (9550SX-12) plus a cache battery (!!).
Advantages
The whole point of this was to free up system resources from RAID duties (mostly kernel tasks eating away system time, which isn't that much for RAID 1) but more importantely to gain performance by making use of more disks over multiple high bandwidth channels. This was achieved by the 3ware controller which does a wonderful job at managing devices and RAID volumes on its own, independently of the operating system. In addition, the Linux kernel does include a driver that supports the card and the vendor's management tool (tw_cli) is very good.
Below is a quick listing of me detaching two independent disks and reattaching them in a RAID 1 array. The backup battery unit had not been charge-tested yet by the controller (a 20+ hour process), so it refused to enable functionality that depended on it.
# tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-1 OK - - - 1862.63 OFF OFF u1 JBOD OK - - - 931.513 OFF OFF u2 JBOD OK - - - 931.513 OFF OFF u3 RAID-1 OK - - - 186.254 OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 NOT-PRESENT - - - - p1 OK u0 1.82 TB 3907029168 WD-WCAZA3206335 p2 NOT-PRESENT - - - - p3 OK u0 1.82 TB 3907029168 WD-WCAZA3189743 p4 NOT-PRESENT - - - - p5 NOT-PRESENT - - - - p6 OK u1 931.51 GB 1953525168 5QJ0RVB7 p7 OK u2 931.51 GB 1953525168 5QJ0ZA08 p8 NOT-PRESENT - - - - p9 NOT-PRESENT - - - - p10 OK u3 189.92 GB 398297088 B41AARNH p11 OK u3 189.92 GB 398297088 B41AB7KH Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On No Testing OK OK 0 xx-xxx-xxxx # tw_cli /c0/u2 del Deleting /c0/u2 will cause the data on the unit to be permanently lost. Do you want to continue ? Y|N [N]: Y Deleting unit c0/u2 ...Done. # tw_cli /c0/u1 del Deleting /c0/u1 will cause the data on the unit to be permanently lost. Do you want to continue ? Y|N [N]: Y Deleting unit c0/u1 ...Done. # tw_cli /c0 add type=raid1 disk=6-7 storsave=balance Creating new unit on controller /c0 ... Done. The new unit is /c0/u1. Setting Storsave policy to [balance] for the new unit ... Done. Setting default Command Queuing policy for unit /c0/u1 to [on] ... Done. Setting write cache=ON for the new unit ...Failed . BBU is not ready. Use /c0/u1 set cache=ON command to change the write cache policy when the BBU is ready. # tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-1 OK - - - 1862.63 OFF OFF u1 RAID-1 OK - - - 931.312 OFF OFF u3 RAID-1 OK - - - 186.254 OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 NOT-PRESENT - - - - p1 OK u0 1.82 TB 3907029168 WD-WCAZA3206335 p2 NOT-PRESENT - - - - p3 OK u0 1.82 TB 3907029168 WD-WCAZA3189743 p4 NOT-PRESENT - - - - p5 NOT-PRESENT - - - - p6 OK u1 931.51 GB 1953525168 5QJ0RVB7 p7 OK u1 931.51 GB 1953525168 5QJ0ZA08 p8 NOT-PRESENT - - - - p9 NOT-PRESENT - - - - p10 OK u3 189.92 GB 398297088 B41AARNH p11 OK u3 189.92 GB 398297088 B41AB7KH Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On No Testing OK OK 0 xx-xxx-xxxx # dmesg | tail (...) [58946.312871] 3w-9xxx: scsi0: AEN: INFO (0x04:0x001A): Drive inserted:port=7. [58946.371418] 3w-9xxx: scsi0: AEN: INFO (0x04:0x001F): Unit operational:unit=2. [58946.396867] sd 0:0:2:0: [sdc] Attached SCSI disk [59352.626254] scsi 0:0:1:0: Direct-Access AMCC 9550SX-12 DISK 3.08 PQ: 0 ANSI: 5 [59352.626400] sd 0:0:1:0: Attached scsi generic sg1 type 0 [59352.626770] sd 0:0:1:0: [sdc] 1953103872 512-byte logical blocks: (999 GB/931 GiB) [59352.627651] sd 0:0:1:0: [sdc] Write Protect is off [59352.627654] sd 0:0:1:0: [sdc] Mode Sense: 23 00 00 00 [59352.628233] sd 0:0:1:0: [sdc] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA [59352.783431] sdc: unknown partition table [59352.886156] sd 0:0:1:0: [sdc] Attached SCSI disk
Disadvantages
Downsides of this solution were few and at the time mostly neglectible. The 3ware driver for Linux is functional but there are reports of implementation issues, related to interrupt management and PCI interaction. It is a universal 3ware driver that supports a multitude of similar controllers maintained by the vendor but it seems that updates are focused on supporting new cards. Another con of the hardware RAID route is that on-disk format of the data is managed by the card which means that there is a strong possibility that disks and RAID volumes become readable only by compatible 3ware controllers (using the same on-disk format). This reduces flexibility and increases risk in case the controller fails. There is documentation on the Internet that shows this.
In use
Despite the disadvantages which I considered at first but digested over the initial period of testing I decided to go ahead and modify my server from software to hardware RAID. Both my 1TB and 2TB disks were made into RAID1 volumes which the operating system happily uses as if they were single disks which is very cool. I used these volumes as simple disks which I partitioned and gave to LVM.
The card supports and handles hot-swapping and moving disks between physical ports well. I disconnected and connected disks while the volumes were up and all went smoothly. I can't be sure now, but I don't think the card rebuilt the entire volumes - just the blocks that had changed. Swapping ports was no trouble either (even online) all disks were recognised and put into the correct volumes. Booting worked well too, so no complaints in terms of functionality.
However, and in line with reports on the Internet, performance in multiple access situations was not great - the system kind of locked down while multiple heavy I/O operations were taking place. Sure every system becomes sluggish when lots of I/O is happenning, but operations in memory, using cached files, using the shell, etc, all that stuff keeps working smoothly, as long as it does not need to touch disks. Unlike with the 3ware where the shell would become unresponsive to keyboard input. Single operation performance on the other hand was great! Can't remember the numbers - must have then written somewhere.
Breakage
A few months into the break-in period, I was finding ext4 errors being reported by the kernel and also my fault for not adding auto fsck to fstab (in a nutshell, it's the last column of fstab entries: '2' for non-root volumes, '1' for root, ''0' for swap. 'man fstab' for more info).
Keep calm and carry on.
Not happy about it, I decided to fix the errors, scan the disks, look in host and guest system logs and look for hardware faults on the controller logs. None found. A bit of research into Xen, ext4, LVM, 3ware, etc, revealed few clues.
Assuming it might be issues with ext4, I tried changing a few less important file systems back to ext3 which may be bad in many ways, but _not_ in stability. Soon into this operation errors became frequent, appeared under ext3 too and worryingly operations on one file system were generating errors in other file systems (eek!!). Therefore something bad was wrong. At this point the host OS's root filesystem started to fall apart and important files went missing.
Now panic.
In disaster recovery mode, I decided not to touch anything, verify if the file server vm was working (it was), buy a large external disk and proceeded to copy over all the important information out of it over the network (which took the most part of 2 days). This is exactly the type of trouble that raid1 won't get you out of - file system corruption. Fortunately the Xen guest images were largely unaffected so I mostly ok, although I was not fully aware of the extent of the damage at the time.
Incident analysis
Frankly I don't know what caused the file system corruption. However, the simple fact that corruption happened under ext4 *and* ext3, and that operations on one file system caused problems in other file systems leads me to look away from the file system itself and into some lower layer of code. Below the file system there is vfs, lvm and the 3ware driver on the kernel. Further downstream, we have the controller itself and the disks. Any of the above is able to interfere simultaneously in more than one file system, and I would imagine it would likely do if something misbehaved. Other variables to throw into the mix are, of course, Xen 4.1.1.
Given that I don't often have this type of issue, I decided to take back the last change that I had introduced: the RAID1 hardware implementation.
I went back to software raid, reinstalled the server and performed some tests which went well. I'm using the same disks as I didn't find any fault in them and I am also using the same controller card, except all disks are now being exported directly rather than in a RAID volume (some would call this JBOD exports). I couldn't resist not using the controller's 1GB of battery supported read/write cache memory... Hopefully it is not faulty.
Conclusion
If the same problem does not happen again, then I have to assume that something in the driver or hardware raid1 implementation is wrong or does not play nicely with Linux and/or with Xen. In the mean time I will also try to buy another SATA-II PCI-X card, but this time RAID is purely optional.
No comments:
Post a Comment