Physical Disk Failures
Posted: Mon Feb 10, 2014 12:58 pm
PD failures are common... I wanted to discuss/share/learn how to properly audit/verify the PD failure and recovery process. I have seen drives fail, then come back online, then fail again a week later. I am curious as to what the workflow is for a PD failure, where in that workflow does it try to re-spin up the drive, or move readable chunklets off the drive vs rebuild them from parity.. at what point will it stop trying to re-spin up the drive and just rebuild everything from parity?
How many different ways does a PD fail, and how does the system react differently to each? I can think of a few... You have 1 port-A or port-B failures, both ports failing, over 5 chunklets go bad on the drive (media errors), and a failed drive that can no longer be read from at all.
So at the point of a drive fail, after an alert is sent out to the customer and HP... what is next?
To see which drives are failed:
showpd -failed -degraded
Shows chunklets that are moved, scheduled to move or are moving:
showpdch -mov
Show chunklets that have moved, or are moving from a specific PD
showpdch -from <pdid>
It appears that "showpdch -sync" may reveal which chunklets are being rebuilt from parity.
It appears that "showpdch -log" may show which chunklets are offline, but being serviced through parity reads, and logged writes, as in what happens during a service mag to the other 3 drives on a 4 drive magazine.
One thing I would like to be able to do is confirm for the field tech that the system is ready for him to come onsite. What I do currently to "check" this is a couple things, because I am not 100% confident the first few are absolulte.
1: showpd -space <failed pd #>
If I don't see volume at 0, then I assume the drive evac/rebiuld is not complete yet.
2: showpdch -mov <failed pd #>
If I see any chunklets still on PDID 285 (the failed one) or that have the To field with data in it, I will assume the rebuild/evac is not done yet.
Is there anyway to view the tasks that relocate/rebuild these chunklets? I dont see anything in my showtask history.
How many different ways does a PD fail, and how does the system react differently to each? I can think of a few... You have 1 port-A or port-B failures, both ports failing, over 5 chunklets go bad on the drive (media errors), and a failed drive that can no longer be read from at all.
So at the point of a drive fail, after an alert is sent out to the customer and HP... what is next?
To see which drives are failed:
showpd -failed -degraded
Shows chunklets that are moved, scheduled to move or are moving:
showpdch -mov
Show chunklets that have moved, or are moving from a specific PD
showpdch -from <pdid>
It appears that "showpdch -sync" may reveal which chunklets are being rebuilt from parity.
It appears that "showpdch -log" may show which chunklets are offline, but being serviced through parity reads, and logged writes, as in what happens during a service mag to the other 3 drives on a 4 drive magazine.
One thing I would like to be able to do is confirm for the field tech that the system is ready for him to come onsite. What I do currently to "check" this is a couple things, because I am not 100% confident the first few are absolulte.
1: showpd -space <failed pd #>
Code: Select all
ESFWT800-1 cli% showpd -space 285
-----------------(MB)------------------
Id CagePos Type -State- Size Volume Spare Free Unavail Failed
285 7:9:1 FC failed 285440 0 0 0 0 285440
----------------------------------------------------------------
1 total 285440 0 0 0 0 285440
If I don't see volume at 0, then I assume the drive evac/rebiuld is not complete yet.
2: showpdch -mov <failed pd #>
Code: Select all
ESFWT800-1 cli% showpd -space 285
-----------------(MB)------------------
Id CagePos Type -State- Size Volume Spare Free Unavail Failed
285 7:9:1 FC failed 285440 0 0 0 0 285440
----------------------------------------------------------------
1 total 285440 0 0 0 0 285440
ESFWT800-1 cli% showpdch -mov
Pdid Chnk LdName LdCh State Usage Media Sp Cl From To
42 584 tp-2-sd-0.144 514 normal ld valid Y N 285:793 ---
42 792 tp-2-sd-0.69 726 normal ld valid Y N 285:488 ---
42 1084 tp-2-sd-0.86 478 normal ld valid Y N 285:521 ---
102 574 tp-2-sd-0.140 917 normal ld valid Y N 285:785 ---
102 771 tp-5-sd-0.31 181 normal ld valid Y N 285:190 ---
102 1085 tp-2-sd-0.41 438 normal ld valid Y N 285:418 ---
109 580 tp-5-sa-0.3 42 normal ld valid Y N 285:47 ---
109 771 tp-2-sd-0.130 696 normal ld valid Y N 285:697 ---
...
...
---------------------------------------------------------------------------
Total chunklets: 824
If I see any chunklets still on PDID 285 (the failed one) or that have the To field with data in it, I will assume the rebuild/evac is not done yet.
Is there anyway to view the tasks that relocate/rebuild these chunklets? I dont see anything in my showtask history.