Version 12 (modified by 12 years ago) ( diff ) | ,
---|
Clover
Currently on clover in /etc/cron.daily there is a script called
raid_status_email_clover
which checks the raid array for both redundancy and for the presence of a hot spare and then e-mails folks if there is a problem.
The first command raid_status_email_clover calls is (drop-down):
root@clover cron.daily > megacli -ldinfo -lAll -a0 Adapter 0 -- Virtual Drive Information: Virtual Disk: 0 (Target Id: 0) Name: RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3 Size:10.908 TB State: Optimal Stripe Size: 64 KB Number Of Drives:7 Span Depth:1 Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Access Policy: Read/Write Disk Cache Policy: Disabled Encryption Type: None Exit Code: 0x00
Here we've asked the raid array for the status of all of the logical drives on adapter 0. Notice the state now is Optimal. That means that there is redundancy - so if 1 drive fails we will not loose any data. Notice also that it says there are 7 drives although if you look inside the front cover of clover you will see 8 drives. This is because one of the drives is normally set aside as a backup or 'hot spare'
The next command the script executes is (drop down):
root@clover cron.daily > megacli -pdlist -a0 Adapter #0 Enclosure Device ID: 248 Slot Number: 0 Device Id: 0 Sequence Number: 6 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Unconfigured(bad) SAS Address(0): 0x1221000000000000 Connected Port Number: 0 Inquiry Data: JK1171YAG7M0VSHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: Foreign Foreign Secure: Drive is not secured by a foreign lock key Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Enclosure Device ID: 248 Slot Number: 1 Device Id: 1 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online SAS Address(0): 0x1221000001000000 Connected Port Number: 1 Inquiry Data: JK1171YAGABE6SHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: None Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Enclosure Device ID: 248 Slot Number: 2 Device Id: 2 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online SAS Address(0): 0x1221000002000000 Connected Port Number: 2 Inquiry Data: JK1171YAG7X0KSHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: None Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Enclosure Device ID: 248 Slot Number: 3 Device Id: 6 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online SAS Address(0): 0x1221000003000000 Connected Port Number: 3 Inquiry Data: JK1171YAG79DKSHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: None Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Enclosure Device ID: 248 Slot Number: 4 Device Id: 3 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online SAS Address(0): 0x1221000004000000 Connected Port Number: 4 Inquiry Data: JK1171YAG7S06SHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: None Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Enclosure Device ID: 248 Slot Number: 5 Device Id: 4 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online SAS Address(0): 0x1221000005000000 Connected Port Number: 5 Inquiry Data: JK1171YAG7UGBSHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: None Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Enclosure Device ID: 248 Slot Number: 6 Device Id: 5 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online SAS Address(0): 0x1221000006000000 Connected Port Number: 6 Inquiry Data: JK1171YAG77AJSHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: None Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Enclosure Device ID: 248 Slot Number: 7 Device Id: 7 Sequence Number: 4 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online SAS Address(0): 0x1221000007000000 Connected Port Number: 7 Inquiry Data: JK1171YAG6B6ESHitachi HDS722020ALA330 JKAOA20N FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Foreign State: None Device Speed: Unknown Link Speed: Unknown Media Type: Hard Disk Device Exit Code: 0x00
This queries the raid array for the individual physical drive statuses for drives on adapter 0. Note the Slot Number, Device ID, and the Firmware state for each one. If you look at the front of clover you will see each drive is labeled with a device ID, and each slot is labeled with a slot number. This is so you know which drive to pull when one goes bad. So looking at the output there are 7 drives whose firmware state is online. This means we have redundancy and explains why the logical drive status is optimal. Now if we look at the firmware state of the 0th drive - we can see that it is 'Unconfigured(bad)'. This is because recently the drive went bad (or the raid controller thought so). When this happened it looked for the hot spare (which was drive 7) and began rebuilding the array using this drive instead of the bad drive. Now however we would like to have a new hot spare handy in case another drive fails. To do this however, we need to shut down clover - pull out the bad drive (making sure the slot number matches) - and put it a new one in its place.
After we swap in the new drive it will still be in an unconfigured state. To enable it to be the hot spare we need to run
megacli -PDHSP -Set -PhysDrv[A:B] -aC
where
- A: It’s Enclosure #
- B: It’s Slot #
- C: It’s Adapter #
so in this case we would run
megacli -PDHSP -Set -PhysDrv[248:0] -a0
During the rebuild process the firmware state for drive 7 would have been 'Rebuild'. To monitor its progress we could have run (as root)
megacli -PDRbld -ShowProg -PhysDrv [248:7] -a0 Rebuild Progress on Device at Enclosure 248, Slot 7 Completed 36% in 559 Minutes. Exit Code: 0x
Note however that the minutes only count up to 1092 and then reset to zero…
Grass
Currently on grass in /root, you can open the raid array's manager GUI by typing
./megarcmgr
Within the GUI you can check the status of the logical and physical drives, rebuild failed drives, and designate drives as hot spares.
From the main page you can navigate the "Management Menu" with the arrow keys, and select the menu you want to open with "enter". To return to a previous menu press the Esc key. Esc also allows you to quit the program if you are at the top of the menu tree.
To check the configuration of the raid array, go to Configure>View/add Configuration:
Here you can check the status of the drives:
There are 8 hard drives in array, with ports ranging from 1-8 (Note the ID label instead runs from 0-7). In this image, the 8th port contains the hot spare. In a good configuration, all of the drives are marked as "online". From this screen, note the options at the bottom. You can hit F3 to check the status of the logical drives:
The raid array is optimal when all of the logical drives report so.
In the event a drive fails, you can rebuild the array by taking the following steps:
- Open the GUI and find which driver failed (this could be done by noting the port of the drive that failed, as in the menu of the above screen shot). This drive should be marked by a status statement such as "unconfigured state (bad)". Record the port number and turn grass off.
- Open the front of grass and find the port (or slot) of the bad drive. Pull this drive out and replace with a new drive.
- Turn Grass back on.
- The next step depends on the state of the array after you replaced the failed drive. There are two options, make sure you choose the correct option for your situation.
Option A. Open the manager again to check the new status of the array. If you correctly identified the bad drive the status of the failed drive should now read something like "unconfigured- good". Now it is time to rebuild, and designate this new drive as the hot spare. From the Management Menu go to Objects>Physical Drive. Scroll down to the replaced drive and hit enter to bring up a new menu bar. Select rebuild and hit enter:
This drive should now be designated as the new hot spare. In case it is not, highlight the drive again, and this time select "Make Hot Spare" from the menu.
Option B. In the event the wrong drive was pulled there would now be TWO drives that are marked as bad… This is very dangerous, as redundancy is lost in this state. Before proceeding, turn Grass off again and swap back the drives you just exchanged. You must rebuild the redundancy before attempting to pull the drive again. Do this by going to Management Menu>Objects>Physical Drive and highlighting the bad drive (there should only be one now — the original one you took out). Hit enter. Select rebuild from the menu. Once redundancy has been restored, it is safe to try to pull the bad drive again. Maybe the ID or slot numbers are mislabeled on Grass, and you must repeat this procedure until the correct drive has been pulled (as will be noticed when only one drive is unconfigured after you make the swap). Then you may follow option A.