RAID 5 with 2 dead HDDs

Erice · Jan 23, 2014

Suppose that a friend has a server setup to run from a set of 4 300GB SAS hard drives configured for RAID5, and two HDDs failed. Perhaps one had failed earlier, but was undetected, then a second one failed.

Is there any hope for data recovery? Going to the backup, unfortunately, is not an option.

TMetzinger · Jan 23, 2014

Erice said:
Suppose that a friend has a server setup to run from a set of 4 300GB SAS hard drives configured for RAID5, and two HDDs failed. Perhaps one had failed earlier, but was undetected, then a second one failed.

Is there any hope for data recovery? Going to the backup, unfortunately, is not an option.

probably not, for a full recovery.

Sent from my Nexus 7 using Tapatalk

docmirror · Jan 23, 2014

Nope, you're dead. Only way two drive fail works is if you had a hot-spare in the same disk group, and auto failover enabled on the RAID engine to use the hot spare when the first drive went TU.

Erice · Jan 23, 2014

That's what I was afraid of.

CT4ME · Jan 23, 2014

'Sux when that happens... of course, they're backed-up, right? If they've successfully ignored that blinking "I have a dead drive" light for months/years, it's quite likely that nobody has been checking backups either.

PresandCeo · Jan 24, 2014

There are (expensive) services that will physically open a failed hard drive in a clean room and attempt to image data directly from the disk platters with a new read head. There are also recovery services that can attempt to reassemble data from a 2 drive failure in a RAID5, but they are rarely very successful at obtaining useful data.

In a 4 drive, RAID5 configuration every chunk of data is split 3 ways and spread across all 4 drives in such a way that if any one drive fails the other 3 drive contain one of the duplicates of each chunk spread among themselves.

In a 2 drive failure, 1/3 of the copies of those chunks are toast, so at best 2/3 of the data, in little bits and pieces, is still in existence. Problem is that 2/3 of a binary file is generally corrupt and garbage.

Sorry. Time to revert to backups or pony up $$$ for a risky recovery attempt.

RJM62 · Jan 24, 2014

I used to deal with a place in CT that was able to do a recovery like that for one of my clients years ago. (It wasn't cheap.)

-Rich

ja_user · Jan 24, 2014

This is fairly common. The goes undetected part. Probably the same reason the backup can't be used?

kgruber · Jan 24, 2014

What does it look like when one HD of a RAID goes bad?

John221us · Jan 24, 2014

Yes, it is possible to recover the data. It is VERY expensive. The last time one of my customers had this done, it was in the neighborhood of $10K. They won't guarantee they can recover it, but they will be able to tell if they can do the recovery for a minimal price, before proceeding to the full recovery.

RJM62 · Jan 24, 2014

John221us said:
Yes, it is possible to recover the data. It is VERY expensive. The last time one of my customers had this done, it was in the neighborhood of $10K. They won't guarantee they can recover it, but they will be able to tell if they can do the recovery for a minimal price, before proceeding to the full recovery.

The problem is that at a minimum, I think they'd need to resurrect the more recently failed of the two drives to even be able to make that determination with any degree of certainty.

The nice thing about the place I dealt with was that they didn't charge for the evaluation, and if there was no recovery, there was no fee. I'm sure they ate a few mistakes over the years, but I felt comfortable referring clients to them.

As an aside, I've been preaching for years that RAID is not backup. It's a downtime-prevention tool. I've lost track of how many times RAID controllers have gone bad in spectacular ways, taking their whole arrays down in flames with them.

-Rich

ja_user · Jan 24, 2014

kgruber said:
What does it look like when one HD of a RAID goes bad?

As long as you notice it and swap the drive, you don't lose or have to do anything. The controller will rebuild it based upon the other drives.

Normally there is a light or alarm, or software that warns you. Or at the actual PC itself.

John221us · Jan 25, 2014

RJM62 said:
The problem is that at a minimum, I think they'd need to resurrect the more recently failed of the two drives to even be able to make that determination with any degree of certainty.

The nice thing about the place I dealt with was that they didn't charge for the evaluation, and if there was no recovery, there was no fee. I'm sure they ate a few mistakes over the years, but I felt comfortable referring clients to them.

As an aside, I've been preaching for years that RAID is not backup. It's a downtime-prevention tool. I've lost track of how many times RAID controllers have gone bad in spectacular ways, taking their whole arrays down in flames with them.

-Rich

The data we had recovered was a double drive failure, raid 5 and the data lived in a virtual drive on a VMFS file system. They can do amazing things. It is just expensive. This recovery place was near Livermore, CA. I could get the name. It was about two years ago.

There is no substitute for a good backup.

LDJones · Jan 25, 2014

kgruber said:
What does it look like when one HD of a RAID goes bad?

My NAS units flash a message and one of them sends me a text message and an email if anything goes wrong inside.

Ghery · Jan 26, 2014

My RAID 5 box was on-site back-up. The power supply quit last April and it hasn't spoken to me since. However, I have another external drive connected to my laptop plus everything is backed up to Carbonite. On-site backup is great, but if you have a fire, guess what? Need off-site, as well.

ejensen · Jan 26, 2014

When I worked with RAID 5 you could tell when a drive was down cause the system slowed to a crawl. Servers nowadays may be faster.

docmirror · Jan 26, 2014

ejensen said:
When I worked with RAID 5 you could tell when a drive was down cause the system slowed to a crawl. Servers nowadays may be faster.

This is because the RAID engine is recreating the data from the XOR engine on each read. Usually, the RAID engine simply uses the parity stripe as a checkpoint, and not for recreation.

Also, I should have mentioned that there are companies that try to recover data from RAID drives with a dual failure with limited success. As mentioned, they are very expensive per 100MB.

denverpilot · Jan 27, 2014

RJM62 said:
As an aside, I've been preaching for years that RAID is not backup. It's a downtime-prevention tool. I've lost track of how many times RAID controllers have gone bad in spectacular ways, taking their whole arrays down in flames with them.

RAID isn't even that great at lowering downtime if you consider any time with a disk dead as "downtime". Sure the server is running but you're paying someone to go swap the dead stuff. If the business purpose of the server is costing more being down than paying the overhead to keep people to swap things, there's a break-even there somewhere.

There have been some studies and mathematical proofs that relying on one hard disk to spin = a generally set number of failures. So naturally if you multiply this number by the number of disks in your RAID, arrays with larger numbers of disks actually experience disk failure over a much bigger period of time overall. This means added run time in the highly risky degraded state.

Without monitoring and someone to immediately swap disks, data loss risk goes higher quickly on large multi-disk RAID 5 arrays as you add more disks. A 3 disk RAID 5 with two hot spare disks, is superior mathematically than a five disk RAID 5, for example.

Assuming, as Rich pointed out, the failure isn't in the controller itself.

wsuffa · Jan 27, 2014

Oh, and avoid Seagate drives. I saw something recently that revealed that Seagate (still) has the highest failure rate of any. I've had my share of failures with WD, too.

zaitcev · Jan 27, 2014

kgruber said:
What does it look like when one HD of a RAID goes bad?

Both write and read performance nosedive. Other than that, nothing. One must watch out for it with management tools.

zaitcev · Jan 27, 2014

wsuffa said:
Oh, and avoid Seagate drives. I saw something recently that revealed that Seagate (still) has the highest failure rate of any. I've had my share of failures with WD, too.

Next time, post a link yourself

http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/

jsstevens · Jan 27, 2014

zaitcev said:
Both write and read performance nosedive. Other than that, nothing. One must watch out for it with management tools.

From a user's perspective, yes. But the array hardware & software should be screaming bloody murder unless you have a hot spare configured. And even then it should be whining.

And as was said above - it ain't a backup. It is an availability tool. Backups are bought for catastrophic failures (meteor through the machine room). They are used (most often) because I really wish I hadn't deleted that. RAID will not save you in either case.

John (Who spent 10 years building backup SW and RAID systems.)

jesse · Jan 27, 2014

FWIW, in my years of IT, I've come to the conclusion that RAID card failures are just as common as a hard drive in the array failing. We do everything we can to avoid RAID now and generally just spec out one solid state drive for application servers which has proven to be a more available solution.

docmirror · Jan 27, 2014

Well, empirical evidence, logic, and my experience show that hardware failures in the RAID engine circuitry that would cause a significant loss of data are vanishingly rare when compared to hardware failures in rotating media, which is the common makeup of a RAID system.

Most RAID engines are now redundant, and also have a failsafe in the event of many faults that will flush cache and protect customer data at all cost. As well, the RAID systems I work on regularly have fully redundant pathing, and load balancing as well. The rotating media is by far the most prone to failure.

YMMV

zaitcev · Jan 27, 2014

Some apps can nowadays take advantage of cloudy tech that trickles down in the object store area, such as OpenStack Swift. It's resilience is amazing. But then again, you still need spares all the time drives die.

jesse · Jan 27, 2014

zaitcev said:
Some apps can nowadays take advantage of cloudy tech that trickles down in the object store area, such as OpenStack Swift. It's resilience is amazing. But then again, you still need spares all the time drives die.

I'm a big AWS fan and user myself.

TMetzinger · Jan 27, 2014

jesse said:
FWIW, in my years of IT, I've come to the conclusion that RAID card failures are just as common as a hard drive in the array failing. We do everything we can to avoid RAID now and generally just spec out one solid state drive for application servers which has proven to be a more available solution.

Huh. I've had bad luck with SSDs.

My Fujitsu SAN has been terrific. But that is an enterprise device with redundant controllers and such. I agree that single controllers are just as much a failure point as good hard disks.

jesse · Jan 27, 2014

TMetzinger said:
Huh. I've had bad luck with SSDs.

My Fujitsu SAN has been terrific. But that is an enterprise device with redundant controllers and such. I agree that single controllers are just as much a failure point as good hard disks.

We've had really really good luck just using consumer SSD's in servers where we don't care about the data and just want decent IO and good availability. Previously we'd often do RAID because one spinning drive just didn't meet the availability requirements but year after year we replaced just as many failed RAID cards as failed drives.

We've made the software in our technology smarter over the years to where RAID on the server's doesn't make sense. Our largest storage servers are MogileFS https://code.google.com/p/mogilefs/ clusters just full of typical drives, no RAID, and works beautifully. I'll take building smarter software versus enterprise hardware any day.

RJM62 · Jan 27, 2014

jesse said:
FWIW, in my years of IT, I've come to the conclusion that RAID card failures are just as common as a hard drive in the array failing. We do everything we can to avoid RAID now and generally just spec out one solid state drive for application servers which has proven to be a more available solution.

Just as common, and often far more devastating when it happens. RAID controllers are fickle and vindictive. They often die in a blaze of glory and take their whole arrays down with them, sometimes in ways that defy recovery.

I used to preach this until I was blue in the face to clients I picked up from another consultant who went out of business, but a few of them still refused to do good backup, having been sold a RAID bill of goods. They thought I was trying to upsell them. Until the controllers failed and the arrays were hosed, and they found out how much recovering a hosed RAID array costs.

On many graves the headstones glisten, of those who heard but wouldn't listen.

-Rich

denverpilot · Jan 28, 2014

jesse said:
I'll take building smarter software versus enterprise hardware any day.

That really depends on what you're doing. In your experience, commodity hardware is the right call.

Building a Public Safety Radio dispatch system? Maybe the SuperMicro isn't the right box for the job. Motorola vs ______? Who's going to ship ten engineers to your doorstep when something goes wrong and lives are literally on the line?

Layers 8 & 9 of the OSI model: Religion & Politics.

Some shops have to buy certain brands because their buddy on the Board of Directors sells that brand, too. Heh. Or it's a large customer. Stupid, but it is life in IT.

I'll give you one guess why the above conversation didn't include Cisco as an option. And it starts with AT & ends with T.

Big companies get into bed with other big companies, right or wrong, they do it. I've been fortunate enough to see both worlds, commodity hardware cranking out piles of cash, and name brand hardware cranking out piles of cash but only because if you stopped using it you'd lose your sweetheart deal with the customer who wants you to use the name brand junk.

Engineers can ride in the back of the bus when the CxOs golfing buddy sells brand X and the Engineers want Brand Y. Or the CxO sees his bonus disappearing into a higher than expected capital cost for Brand Y.

Enjoy being private while you can. Public companies waste incredible amounts of money on stupid crap that lowers their bottom line. I'll just toss the wrong-headed ideas about what Sarbanes-Oxley requires out there as one solid example. SOX compliance audits are about one half a step higher than "utterly retarded" at most companies, because they listen to consultants who get paid by the hour to implement it.

jesse · Jan 28, 2014

denverpilot said:
That really depends on what you're doing. In your experience, commodity hardware is the right call.

Building a Public Safety Radio dispatch system? Maybe the SuperMicro isn't the right box for the job. Motorola vs ______? Who's going to ship ten engineers to your doorstep when something goes wrong and lives are literally on the line?

Layers 8 & 9 of the OSI model: Religion & Politics.

Some shops have to buy certain brands because their buddy on the Board of Directors sells that brand, too. Heh. Or it's a large customer. Stupid, but it is life in IT.

I'll give you one guess why the above conversation didn't include Cisco as an option. And it starts with AT & ends with T.

Big companies get into bed with other big companies, right or wrong, they do it. I've been fortunate enough to see both worlds, commodity hardware cranking out piles of cash, and name brand hardware cranking out piles of cash but only because if you stopped using it you'd lose your sweetheart deal with the customer who wants you to use the name brand junk.

Engineers can ride in the back of the bus when the CxOs golfing buddy sells brand X and the Engineers want Brand Y. Or the CxO sees his bonus disappearing into a higher than expected capital cost for Brand Y.

Enjoy being private while you can. Public companies waste incredible amounts of money on stupid crap that lowers their bottom line. I'll just toss the wrong-headed ideas about what Sarbanes-Oxley requires out there as one solid example. SOX compliance audits are about one half a step higher than "utterly retarded" at most companies, because they listen to consultants who get paid by the hour to implement it.

I have no interest in that side of the industry. I'd just move on to somewhere else where you can actually build cool **** and have fun

denverpilot · Jan 28, 2014

jesse said:
I have no interest in that side of the industry. I'd just move on to somewhere else where you can actually build cool **** and have fun

The golden handcuffs can be strong. Ha. People willing to lay for name brand stuff are often willing to pay pretty handily for people to work on their name brand stuff. Heh.

See: BMW dealership.

Of course, both the Pugh wrenching on the Ford and the guy wrenching on the BMW both know the crap is going to break...

Which brings us back to RAID. Ha.

Hope the poor guy figured out how bad he needed his data and how expensive his mistake of not having backups cost, so he won't do that again. Losing data sucks, but it's infinitely avoidable. Hell, I'd back up to an $80 3TB single USB drive before I'd go without backups.

No backup and lost data? Been there, done that, have the t-shirt. You know, the one that says, "Here's your sign."

denverpilot · Jan 28, 2014

iPad changed "dude" to "Pugh"? Really autocorrect? Strange.

SCCutler · Jan 28, 2014

In my 15 years of law-firming, I had one hard drive fail with a buttload of non-backed-up data; shut it down, allowed it to cool, booted it back up and copied all from it before it lunched again. Lesson (kind of) learned.

Bought a Compaq DL380 server, RAID5, and placed it under maintenance. One drive failed one day, they overnighted me another one and it was fully built back within a couple of hours of being plugged in. I sweated while waiting, but it never slowed. If I recall correctly, that CPQ had redundant power supplies, too. Nice box, running Linux, and the only time it was ever shut down was when we wanted it to be. But we still weren't properly backing up.

Now, we have three servers, all RAID, and all backed up every night, off-site and automagic. Much better feeling. And, when someone deletes something they should not have, it's a phone call away.

docmirror · Jan 28, 2014

Heh, you guys are working on some small scale retail stuff from Fry's. Fuggeddabouttit. Chump change for busted RAID gear.

Last week I was working on the storage network for a trading house that executes 1000 trades a second or more. Also working on a system that supports all the power grid for seven southern states. One of my biggest accounts is a well known entertainment production company that you see every day, all around the world.

We build and maintain stuff for every bit that routes around the internet. Five nines reliability is far too low for this. Yes, most of it is stored on RAID of some kind, but we don't shop at Fry's.

wsuffa · Jan 28, 2014

denverpilot said:
That really depends on what you're doing. In your experience, commodity hardware is the right call.

Building a Public Safety Radio dispatch system? Maybe the SuperMicro isn't the right box for the job. Motorola vs ______? Who's going to ship ten engineers to your doorstep when something goes wrong and lives are literally on the line?

Layers 8 & 9 of the OSI model: Religion & Politics.

Some shops have to buy certain brands because their buddy on the Board of Directors sells that brand, too. Heh. Or it's a large customer. Stupid, but it is life in IT.

I'll give you one guess why the above conversation didn't include Cisco as an option. And it starts with AT & ends with T.

Big companies get into bed with other big companies, right or wrong, they do it. I've been fortunate enough to see both worlds, commodity hardware cranking out piles of cash, and name brand hardware cranking out piles of cash but only because if you stopped using it you'd lose your sweetheart deal with the customer who wants you to use the name brand junk.

Engineers can ride in the back of the bus when the CxOs golfing buddy sells brand X and the Engineers want Brand Y. Or the CxO sees his bonus disappearing into a higher than expected capital cost for Brand Y.

Enjoy being private while you can. Public companies waste incredible amounts of money on stupid crap that lowers their bottom line. I'll just toss the wrong-headed ideas about what Sarbanes-Oxley requires out there as one solid example. SOX compliance audits are about one half a step higher than "utterly retarded" at most companies, because they listen to consultants who get paid by the hour to implement it.

"Nobody got fired for buying IBM".

Isn't that how it goes?

kgruber · Jan 28, 2014

Okay.....Sorry about this nube question.

For my scanning project I AM using a 10 yr old Fry's XP computer with RAID. I back up a days scanning on a 1.5 TB external HD I just bought at Costco.

So if I go ....My computer....C....properties and it says 360 gigs total. Does that mean I have two hard drives of 360 Gig each? And one just a copy of the other? And, if one goes out during the day I can still do a good copy to my external HD?

I just scanned a photo of Chief White Calf sitting on his horse.

The actual scan is huge, and there are stories within stories in these negatives.

RJM62 · Jan 28, 2014

kgruber said:
Okay.....Sorry about this nube question.

For my scanning project I AM using a 10 yr old Fry's XP computer with RAID. I back up a days scanning on a 1.5 TB external HD I just bought at Costco.

So if I go ....My computer....C....properties and it says 360 gigs total. Does that mean I have two hard drives of 360 Gig each? And one just a copy of the other? And, if one goes out during the day I can still do a good copy to my external HD?

I just scanned a photo of Chief White Calf sitting on his horse.

The actual scan is huge, and there are stories within stories in these negatives.

It depends what kind of RAID. There's striping (simply stated, splitting the data between drives), which improves I/O but provides no redundancy; there's mirroring, which provides redundancy; and there are combinations of the two (none of which you have if there are only two physical drives).

And no matter what kind of RAID you have, there's the RAID controller, which is the SPOF that can bring the whole array down in flames.

You should, at a minimum, back up all your data to the external drive, not just one day's scans. Better still, do regular drive imaging and online backup, in addition to local data backups.

-Rich

RAID 5 with 2 dead HDDs

Pre-takeoff checklist

Final Approach

Touchdown! Greaser!

Pre-takeoff checklist

Cleared for Takeoff

Filing Flight Plan

Touchdown! Greaser!

Pattern Altitude

Final Approach

En-Route

Touchdown! Greaser!

Pattern Altitude

En-Route

Touchdown! Greaser!

Touchdown! Greaser!

Pattern Altitude

Touchdown! Greaser!

Tied Down

Touchdown! Greaser!

En-Route

En-Route

Final Approach

Touchdown! Greaser!

Touchdown! Greaser!

En-Route

Touchdown! Greaser!

Final Approach

Touchdown! Greaser!

Touchdown! Greaser!

Tied Down

Touchdown! Greaser!

Tied Down

Tied Down

Administrator

Touchdown! Greaser!

Touchdown! Greaser!

Final Approach

Touchdown! Greaser!