Page 1 of 3

Physical Disk Failures

Posted: Mon Feb 10, 2014 12:58 pm
by Richard Siemers
PD failures are common... I wanted to discuss/share/learn how to properly audit/verify the PD failure and recovery process. I have seen drives fail, then come back online, then fail again a week later. I am curious as to what the workflow is for a PD failure, where in that workflow does it try to re-spin up the drive, or move readable chunklets off the drive vs rebuild them from parity.. at what point will it stop trying to re-spin up the drive and just rebuild everything from parity?

How many different ways does a PD fail, and how does the system react differently to each? I can think of a few... You have 1 port-A or port-B failures, both ports failing, over 5 chunklets go bad on the drive (media errors), and a failed drive that can no longer be read from at all.

So at the point of a drive fail, after an alert is sent out to the customer and HP... what is next?

To see which drives are failed:
showpd -failed -degraded

Shows chunklets that are moved, scheduled to move or are moving:
showpdch -mov

Show chunklets that have moved, or are moving from a specific PD
showpdch -from <pdid>

It appears that "showpdch -sync" may reveal which chunklets are being rebuilt from parity.

It appears that "showpdch -log" may show which chunklets are offline, but being serviced through parity reads, and logged writes, as in what happens during a service mag to the other 3 drives on a 4 drive magazine.


One thing I would like to be able to do is confirm for the field tech that the system is ready for him to come onsite. What I do currently to "check" this is a couple things, because I am not 100% confident the first few are absolulte.

1: showpd -space <failed pd #>

Code: Select all

ESFWT800-1 cli% showpd -space 285
                         -----------------(MB)------------------
 Id CagePos Type -State-   Size Volume Spare Free Unavail Failed
285 7:9:1   FC   failed  285440      0     0    0       0 285440
----------------------------------------------------------------
  1 total                285440      0     0    0       0 285440

If I don't see volume at 0, then I assume the drive evac/rebiuld is not complete yet.

2: showpdch -mov <failed pd #>

Code: Select all

ESFWT800-1 cli% showpd -space 285
                         -----------------(MB)------------------
 Id CagePos Type -State-   Size Volume Spare Free Unavail Failed
285 7:9:1   FC   failed  285440      0     0    0       0 285440
----------------------------------------------------------------
  1 total                285440      0     0    0       0 285440
ESFWT800-1 cli% showpdch -mov
Pdid Chnk                 LdName LdCh  State Usage Media Sp Cl     From  To
  42  584          tp-2-sd-0.144  514 normal    ld valid  Y  N  285:793 ---
  42  792           tp-2-sd-0.69  726 normal    ld valid  Y  N  285:488 ---
  42 1084           tp-2-sd-0.86  478 normal    ld valid  Y  N  285:521 ---
 102  574          tp-2-sd-0.140  917 normal    ld valid  Y  N  285:785 ---
 102  771           tp-5-sd-0.31  181 normal    ld valid  Y  N  285:190 ---
 102 1085           tp-2-sd-0.41  438 normal    ld valid  Y  N  285:418 ---
 109  580            tp-5-sa-0.3   42 normal    ld valid  Y  N   285:47 ---
 109  771          tp-2-sd-0.130  696 normal    ld valid  Y  N  285:697 ---
...
...
---------------------------------------------------------------------------
Total chunklets: 824

If I see any chunklets still on PDID 285 (the failed one) or that have the To field with data in it, I will assume the rebuild/evac is not done yet.


Is there anyway to view the tasks that relocate/rebuild these chunklets? I dont see anything in my showtask history.

Re: Physical Disk Failures

Posted: Tue Feb 11, 2014 7:00 am
by ailean
I tend to see three methods;

1) Disk fails, little warning and auto rebuild from parity.
2) Disk failing, sometimes get warnings and auto moves data elsewhere.
3) Disk not happy, maybe a few warnings or not available for allocations but requires manual servicing to start the data rebuild/move process before the engineer arrives.

Support typically are aware if the disk is ready for replacement but not sure what info from the SP uploads they check for that, I suspect the estimates they sometime give are generic based on disk type/size.

The fun tends to begin when the extra load from the rebuild fails another disk and/or when inserting the new disk doesn't go to plan. Three different service companies and over a dozen different engineers in 5 years has led to random events during replacements but no data loss. ;)

Re: Physical Disk Failures

Posted: Fri Feb 14, 2014 3:26 am
by eve
Hi,

Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero:
* NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
* SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE

If one is not zero, drive is not ready to be swapped.


Cheers

Re: Physical Disk Failures

Posted: Mon Feb 24, 2014 5:57 pm
by Richard Siemers
Thanks for that feedback.

What determines what sort of servicemag they will do? I have seen cases where they used logging, which servicemag wasn't initiated until the tech and the part were onsite, and in other cases they do a full service mag several hours ahead before the the tech arrives.

I presume its based on activity of the system, how does one determine which to use and when?

Re: Physical Disk Failures

Posted: Tue Feb 25, 2014 5:52 am
by ailean
It's been a while since I've seen a full evac of a Mag, I'd guess performance, load and % full would be considered. Logging seems to be the norm now, I know they used to have concerns regarding how long you could run with Logging on but we've had failed inserts of new disks that left us running with Logging for several hours until the engineer was able to get hold of someone who knew enough under the hood 3PAR to work around the problem.
It may have only been replacing an entire Mag (there was a time that FC450 disks weren't available so any failures were replaced with FC600 disks plus they had to have all the disks in the Mag the same size) where I've seen this or some early disk replacements where the Mag was only maybe 10% full and Logging was still a new feature.
Although I have had to start the servicemag manually and tell the Engineer to come back in a few hours when Support have forgotten or had them ask me to do it because certain Support staff were using some remote portal that broke often (other teams appeared to have access to better tools at the time and didn't have a problem :) ).

Re: Physical Disk Failures

Posted: Tue Feb 25, 2014 9:48 am
by corge
eve wrote:Hi,

Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero:
* NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
* SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE

If one is not zero, drive is not ready to be swapped.


Cheers


For drive ready to be replaced, only Used OK and Used FAIL need to be 0. HP doesn't care about the others.

Re: Physical Disk Failures

Posted: Tue Feb 25, 2014 9:53 am
by corge
Our reporting shows that a drive has failed.

I double check on the Inserv via command line.

Code: Select all

showpd -failed -degraded


I check servicemag to make sure nothing is running currently

Code: Select all

servicemag status


I issue the command below to get the model number of the drive, since HP will not have it for the company I work for

Code: Select all

showpd -i <PD#>


I issue the following to get the drive position, drive state, chunklet status.

Code: Select all

showpd -c <PD#>


The following two commands are also needed by support

Code: Select all

showversion

Code: Select all

showsys


If replacing a single drive, I issue

Code: Select all

servicemag start -log -pdid <PD#>

The Inserv will begin preparing to take the magazine offline and log the chunklets normally bound for this magazine to other magazines in the system.

Verify the magazine is ready to be pulled by issuing

Code: Select all

servicemag status

You should see SUCCEEDED when it's ready to be pulled. The orange indicator light on the magazine will be lit.

Replace the drive in the magazine, put the magazine back in the Inserv and please wait for the orange light to go away or make sure all of the lights on the magazine are green and NOT blinking. Blinking lights indicate the drives are still spinning up upon initial insertion of the magazine.

Back at command-line, type in

Code: Select all

cmore showpd

You should see the drive placement at the top and says NEW. This just shows you the Inserv sees the new drive and is ready to go.

Issue the following to have the servicemag script resume the magazine.

Code: Select all

servicemag resume <CAGE#> <MAGAZINE#>


That is how it is done here.

Re: Physical Disk Failures

Posted: Thu Mar 06, 2014 2:41 pm
by Richard Siemers
Excellent thank you for the step by step write up!

Re: Physical Disk Failures

Posted: Tue Mar 11, 2014 6:07 am
by eve
corge wrote:
eve wrote:Hi,

Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero:
* NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
* SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE

If one is not zero, drive is not ready to be swapped.


Cheers


For drive ready to be replaced, only Used OK and Used FAIL need to be 0. HP doesn't care about the others.





HP does care about the others.
I have been servicing 3PAR for quite some years and I know from experience that
you may get issues if you do not have all zero

Re: Physical Disk Failures

Posted: Tue Mar 11, 2014 6:18 am
by eve
corge wrote:Our reporting shows that a drive has failed.

I double check on the Inserv via command line.

Code: Select all

showpd -failed -degraded


I check servicemag to make sure nothing is running currently

Code: Select all

servicemag status


I issue the command below to get the model number of the drive, since HP will not have it for the company I work for

Code: Select all

showpd -i <PD#>


I issue the following to get the drive position, drive state, chunklet status.

Code: Select all

showpd -c <PD#>


The following two commands are also needed by support

Code: Select all

showversion

Code: Select all

showsys


If replacing a single drive, I issue

Code: Select all

servicemag start -log -pdid <PD#>

The Inserv will begin preparing to take the magazine offline and log the chunklets normally bound for this magazine to other magazines in the system.

Verify the magazine is ready to be pulled by issuing

Code: Select all

servicemag status

You should see SUCCEEDED when it's ready to be pulled. The orange indicator light on the magazine will be lit.

Replace the drive in the magazine, put the magazine back in the Inserv and please wait for the orange light to go away or make sure all of the lights on the magazine are green and NOT blinking. Blinking lights indicate the drives are still spinning up upon initial insertion of the magazine.

Back at command-line, type in

Code: Select all

cmore showpd

You should see the drive placement at the top and says NEW. This just shows you the Inserv sees the new drive and is ready to go.

Issue the following to have the servicemag script resume the magazine.

Code: Select all

servicemag resume <CAGE#> <MAGAZINE#>


That is how it is done here.





A few thing to add
* showpd -c
Be sure to check for the following columns to be zero:
- NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
- SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE
If one is not zero, drive is not ready to be swapped.
* servicemag -log -pdid xx
the -log option is only needed on S, T and V class systems where you have
four drives on a magazine
-log will divert write IO to the three remaining drives on that mag to other disks
which will be played back to the disks during resume
(read IO is from parity for all four drives)
-log is the 3PAR recommended option for large drives
If on S, T and V-Class systems the -log option is left out, you will issue a full servicemag
which will copy all data from the three remaining drives on that mag to other disks
This will take hours
* If you run a showpd after replacing the failed drive, the new drive may show status
"degraded" instead of new.
This means that the drive is running old firmware.
Just continue with the servicemag resume as the drive will be first upgraded during the resume

If the drive shows "failed" try a reseat