Add to Google Reader or Homepage |
~ pjvenda / blog
$home . blog . photography

27 November 2011

The joys of hardware RAID

Hardware RAID

After having used software RAID on Linux for longer than I'd care to admit, I decided to go business and get a proper RAID controller. I mean having a decent motherboard with a bunch of unused bandwidth (2 channel PCI-X bus), it seemed only fair to make use of it.

I was primarily looking for a good SATA-II PCI-X controller with more than 4 ports. The short list came down to LSI Logic Megaraid 300-8x, Adaptec 2820SA and 3Ware 9550SX-8. Availability and cost end up being the same thing in this case. Most can be bought new but they are extortionately expensive. Alternatively there's the 2nd hand market on ebay... but few cards of this type are there. Eventually I got the 12 port version of the 3ware card (9550SX-12) plus a cache battery (!!).

Photo & Video Sharing by SmugMug
3ware 9550SX-12

Advantages

The whole point of this was to free up system resources from RAID duties (mostly kernel tasks eating away system time, which isn't that much for RAID 1) but more importantely to gain performance by making use of more disks over multiple high bandwidth channels. This was achieved by the 3ware controller which does a wonderful job at managing devices and RAID volumes on its own, independently of the operating system. In addition, the Linux kernel does include a driver that supports the card and the vendor's management tool (tw_cli) is very good.

Below is a quick listing of me detaching two independent disks and reattaching them in a RAID 1 array. The backup battery unit had not been charge-tested yet by the controller (a 20+ hour process), so it refused to enable functionality that depended on it.

# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    OK             -       -       -       1862.63   OFF    OFF    
u1    JBOD      OK             -       -       -       931.513   OFF    OFF    
u2    JBOD      OK             -       -       -       931.513   OFF    OFF    
u3    RAID-1    OK             -       -       -       186.254   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     NOT-PRESENT      -      -           -             -
p1     OK               u0     1.82 TB     3907029168    WD-WCAZA3206335     
p2     NOT-PRESENT      -      -           -             -
p3     OK               u0     1.82 TB     3907029168    WD-WCAZA3189743     
p4     NOT-PRESENT      -      -           -             -
p5     NOT-PRESENT      -      -           -             -
p6     OK               u1     931.51 GB   1953525168    5QJ0RVB7            
p7     OK               u2     931.51 GB   1953525168    5QJ0ZA08            
p8     NOT-PRESENT      -      -           -             -
p9     NOT-PRESENT      -      -           -             -
p10    OK               u3     189.92 GB   398297088     B41AARNH            
p11    OK               u3     189.92 GB   398297088     B41AB7KH            

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           No        Testing   OK       OK       0      xx-xxx-xxxx  

# tw_cli /c0/u2 del
Deleting /c0/u2 will cause the data on the unit to be permanently lost.
Do you want to continue ? Y|N [N]: Y
Deleting unit c0/u2 ...Done.
# tw_cli /c0/u1 del
Deleting /c0/u1 will cause the data on the unit to be permanently lost.
Do you want to continue ? Y|N [N]: Y
Deleting unit c0/u1 ...Done.
# tw_cli /c0 add type=raid1 disk=6-7 storsave=balance
Creating new unit on controller /c0 ... Done. The new unit is /c0/u1.
Setting Storsave policy to [balance] for the new unit ... Done.
Setting default Command Queuing policy for unit /c0/u1 to [on] ... Done.
Setting write cache=ON for the new unit ...Failed
.  BBU is not ready. Use /c0/u1 set cache=ON command 
  to change the write cache policy when the BBU is ready.

# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    OK             -       -       -       1862.63   OFF    OFF    
u1    RAID-1    OK             -       -       -       931.312   OFF    OFF    
u3    RAID-1    OK             -       -       -       186.254   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     NOT-PRESENT      -      -           -             -
p1     OK               u0     1.82 TB     3907029168    WD-WCAZA3206335     
p2     NOT-PRESENT      -      -           -             -
p3     OK               u0     1.82 TB     3907029168    WD-WCAZA3189743     
p4     NOT-PRESENT      -      -           -             -
p5     NOT-PRESENT      -      -           -             -
p6     OK               u1     931.51 GB   1953525168    5QJ0RVB7            
p7     OK               u1     931.51 GB   1953525168    5QJ0ZA08            
p8     NOT-PRESENT      -      -           -             -
p9     NOT-PRESENT      -      -           -             -
p10    OK               u3     189.92 GB   398297088     B41AARNH            
p11    OK               u3     189.92 GB   398297088     B41AB7KH            

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           No        Testing   OK       OK       0      xx-xxx-xxxx  

# dmesg | tail
(...)
[58946.312871] 3w-9xxx: scsi0: AEN: INFO (0x04:0x001A): Drive inserted:port=7.
[58946.371418] 3w-9xxx: scsi0: AEN: INFO (0x04:0x001F): Unit operational:unit=2.
[58946.396867] sd 0:0:2:0: [sdc] Attached SCSI disk
[59352.626254] scsi 0:0:1:0: Direct-Access     AMCC     9550SX-12  DISK  3.08 PQ: 0 ANSI: 5
[59352.626400] sd 0:0:1:0: Attached scsi generic sg1 type 0
[59352.626770] sd 0:0:1:0: [sdc] 1953103872 512-byte logical blocks: (999 GB/931 GiB)
[59352.627651] sd 0:0:1:0: [sdc] Write Protect is off
[59352.627654] sd 0:0:1:0: [sdc] Mode Sense: 23 00 00 00
[59352.628233] sd 0:0:1:0: [sdc] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[59352.783431]  sdc: unknown partition table
[59352.886156] sd 0:0:1:0: [sdc] Attached SCSI disk

Disadvantages

Downsides of this solution were few and at the time mostly neglectible. The 3ware driver for Linux is functional but there are reports of implementation issues, related to interrupt management and PCI interaction. It is a universal 3ware driver that supports a multitude of similar controllers maintained by the vendor but it seems that updates are focused on supporting new cards. Another con of the hardware RAID route is that on-disk format of the data is managed by the card which means that there is a strong possibility that disks and RAID volumes become readable only by compatible 3ware controllers (using the same on-disk format). This reduces flexibility and increases risk in case the controller fails. There is documentation on the Internet that shows this.

In use

Despite the disadvantages which I considered at first but digested over the initial period of testing I decided to go ahead and modify my server from software to hardware RAID. Both my 1TB and 2TB disks were made into RAID1 volumes which the operating system happily uses as if they were single disks which is very cool. I used these volumes as simple disks which I partitioned and gave to LVM.

The card supports and handles hot-swapping and moving disks between physical ports well. I disconnected and connected disks while the volumes were up and all went smoothly. I can't be sure now, but I don't think the card rebuilt the entire volumes - just the blocks that had changed. Swapping ports was no trouble either (even online) all disks were recognised and put into the correct volumes. Booting worked well too, so no complaints in terms of functionality.

However, and in line with reports on the Internet, performance in multiple access situations was not great - the system kind of locked down while multiple heavy I/O operations were taking place. Sure every system becomes sluggish when lots of I/O is happenning, but operations in memory, using cached files, using the shell, etc, all that stuff keeps working smoothly, as long as it does not need to touch disks. Unlike with the 3ware where the shell would become unresponsive to keyboard input. Single operation performance on the other hand was great! Can't remember the numbers - must have then written somewhere.

Breakage

A few months into the break-in period, I was finding ext4 errors being reported by the kernel and also my fault for not adding auto fsck to fstab (in a nutshell, it's the last column of fstab entries: '2' for non-root volumes, '1' for root, ''0' for swap. 'man fstab' for more info).

Keep calm and carry on.

Not happy about it, I decided to fix the errors, scan the disks, look in host and guest system logs and look for hardware faults on the controller logs. None found. A bit of research into Xen, ext4, LVM, 3ware, etc, revealed few clues.

Assuming it might be issues with ext4, I tried changing a few less important file systems back to ext3 which may be bad in many ways, but _not_ in stability. Soon into this operation errors became frequent, appeared under ext3 too and worryingly operations on one file system were generating errors in other file systems (eek!!). Therefore something bad was wrong. At this point the host OS's root filesystem started to fall apart and important files went missing.

Now panic.

In disaster recovery mode, I decided not to touch anything, verify if the file server vm was working (it was), buy a large external disk and proceeded to copy over all the important information out of it over the network (which took the most part of 2 days). This is exactly the type of trouble that raid1 won't get you out of - file system corruption. Fortunately the Xen guest images were largely unaffected so I mostly ok, although I was not fully aware of the extent of the damage at the time.

Incident analysis

Frankly I don't know what caused the file system corruption. However, the simple fact that corruption happened under ext4 *and* ext3, and that operations on one file system caused problems in other file systems leads me to look away from the file system itself and into some lower layer of code. Below the file system there is vfs, lvm and the 3ware driver on the kernel. Further downstream, we have the controller itself and the disks. Any of the above is able to interfere simultaneously in more than one file system, and I would imagine it would likely do if something misbehaved. Other variables to throw into the mix are, of course, Xen 4.1.1.

Given that I don't often have this type of issue, I decided to take back the last change that I had introduced: the RAID1 hardware implementation.

I went back to software raid, reinstalled the server and performed some tests which went well. I'm using the same disks as I didn't find any fault in them and I am also using the same controller card, except all disks are now being exported directly rather than in a RAID volume (some would call this JBOD exports). I couldn't resist not using the controller's 1GB of battery supported read/write cache memory... Hopefully it is not faulty.

Conclusion

If the same problem does not happen again, then I have to assume that something in the driver or hardware raid1 implementation is wrong or does not play nicely with Linux and/or with Xen. In the mean time I will also try to buy another SATA-II PCI-X card, but this time RAID is purely optional.