RAID 1 and you.

wbarnhill · May 29, 2006

So get into work today and the PowerEdge 1650 that we have the Cisco Secure Access Control System being built on is apparently rebooting happily. KVM over to the screen and it blue screens upon loading 2003 and restarts. I start watching the boot sequence a little bit closer and see:

The following containers have been degraded:

1) CSACS Mirrored 36.1GB Status: Critical

Ok, no big deal, so a drive died. RAID 1 should be able to boot off the other drive. But it still blue screens and reboots. Hrm. Use F8 to prevent restart on failure and then see the blue screen in all its glory:

UNMOUNTABLE BOOT VOLUME

WHAT? Start looking for ways to possibly remove the drive from the RAID without killing the data, but can't find anything. Try safe mode and some other options, still nothing. Finally go looking through the disk utilities on the RAID controller and find a "Verify Disk Media" command. Run that, it finds a sector that needs to be remapped, restarts, chkdsk, Windows boots. Woot!

Now the question is.... Why the heck did I have to go through all that trouble? I thought the whole purpose of RAID 1 was to prevent downtime like that. The only thing I could figure is that the dying hard drive spewed crap onto the controller before it died which affected the other drive. Any other ideas?

wsuffa · May 29, 2006

Perhaps the controller (or, if in software, the OS) couldn't distinguish which drive was bad. In which case it went "Pbbbtttt".

If this system boots from one drive or the other, and you lost sectors on the boot drive, that can result in a similar event.

In the end, it appears you didn't lose data, so all is well.

I assume you'll be replacing the drive that had to be remapped.

wbarnhill · May 29, 2006

wsuffa said:
Perhaps the controller (or, if in software, the OS) couldn't distinguish which drive was bad. In which case it went "Pbbbtttt".

If this system boots from one drive or the other, and you lost sectors on the boot drive, that can result in a similar event.

In the end, it appears you didn't lose data, so all is well.

I assume you'll be replacing the drive that had to be remapped.

It's a hardware controller... but it knew which drive was gone, because it listed it as "missing". Removed it from the bay (booting off the one good drive) and still the same thing. We're going to order a new drive to rebuild the mirror, but it doesn't appear that any damage was done... But it's something to remember if we have the same problem in the future.

wsuffa · May 29, 2006

wbarnhill said:
It's a hardware controller... but it knew which drive was gone, because it listed it as "missing". Removed it from the bay (booting off the one good drive) and still the same thing. We're going to order a new drive to rebuild the mirror, but it doesn't appear that any damage was done... But it's something to remember if we have the same problem in the future.

"Unmountable Boot Volume" sounds like a clue that it's trying to boot from the bad drive. There's probably a way around it, but my guess is that the booter points to the drive that was bad, then loads the RAID drivers from there. Can't get the RAID drivers, can't boot into RAID.

Might look at the controller docs and see what needs to be done to boot from the other drive... might be as simple as changing the drive address (you'll probably have figure this out when you get the replacement drive).

wbarnhill · May 29, 2006

wsuffa said:
"Unmountable Boot Volume" sounds like a clue that it's trying to boot from the bad drive. There's probably a way around it, but my guess is that the booter points to the drive that was bad, then loads the RAID drivers from there. Can't get the RAID drivers, can't boot into RAID.

Might look at the controller docs and see what needs to be done to boot from the other drive... might be as simple as changing the drive address (you'll probably have figure this out when you get the replacement drive).

Yeah, I'm looking over exactly what Verify Disk Media does (looks for bad blocks and remaps them) and wondering if the second drive is really dead. Still no clue exactly what caused all this mess, but the server had been sitting in the server room turned off for a few months before we decided to assign it a new task. Boss says to just get another drive and if the good one dies, we'll swap out again. We'll see what happens tho. :/

Back to configuring switches

Brian Austin · May 29, 2006

wsuffa said:
"Unmountable Boot Volume" sounds like a clue that it's trying to boot from the bad drive. There's probably a way around it, but my guess is that the booter points to the drive that was bad, then loads the RAID drivers from there. Can't get the RAID drivers, can't boot into RAID.

If that were the case, RAID controllers would never work, since they would require the drivers to see the drives...and have to get the drivers from the drives they were trying to see!

RAID controllers typically have an onboard BIOS for booting and controlling purposes. You can upgrade the BIOS and even access (varies per controller but Ctrl-D sometimes works) it for building purposes.

I suspect that it was simply a bad block in a critical position (in the MBR most likely). It happens and is fairly common. A good PM plan should have a BIOS level integrity check of drives every quarter or six months, depending on how much data is being moved around.

mikea · May 29, 2006

What brand of controller is this? I've never heard of such an issue with our Compaq/HP Smart Array cards.

It is a fatal flaw. I would look to get a new controller card and backup everything to a new array.

The one I have at home is an NVidia system supported by the motherboard that only works in Windows, I guess, because it needs Windows drivers for the smarts to manage the RAID. I suppose they have enough in ROM get Windows booted. I wouldnt' bet a business on that system.

Brian Austin · May 29, 2006

mikea said:
What brand of controller is this? I've never heard of such an issue with our Compaq/HP Smart Array cards.

It is a fatal flaw. I would look to get a new controller card and backup everything to a new array.

That's something I hadn't considered. Aren't RAID/SCSI controllers supposed to map bad blocks on the fly via CRC or something?

wbarnhill · May 29, 2006

Brian Austin said:
I suspect that it was simply a bad block in a critical position (in the MBR most likely). It happens and is fairly common. A good PM plan should have a BIOS level integrity check of drives every quarter or six months, depending on how much data is being moved around.

Just odd that the bad block occurred at the same time the second drive in the array went kaput. :dunno:

wbarnhill · May 29, 2006

mikea said:
What brand of controller is this? I've never heard of such an issue with our Compaq/HP Smart Array cards.

PERC3/Di (PowerEdge Expandable Raid Controller)

Brian Austin · May 29, 2006

wbarnhill said:
Just odd that the bad block occurred at the same time the second drive in the array went kaput.

Unless it's not checking it correctly to begin with. Might want to run an OS level integrity check, too.

kath · May 29, 2006

I'm thinking of buying a RAID array for RAID 5.
You say Compaq/HP Smart Array is good? Are they linux-friendly?

--Kath

wsuffa · May 29, 2006

Brian Austin said:
If that were the case, RAID controllers would never work, since they would require the drivers to see the drives...and have to get the drivers from the drives they were trying to see!

RAID controllers typically have an onboard BIOS for booting and controlling purposes. You can upgrade the BIOS and even access (varies per controller but Ctrl-D sometimes works) it for building purposes.

I suspect that it was simply a bad block in a critical position (in the MBR most likely). It happens and is fairly common. A good PM plan should have a BIOS level integrity check of drives every quarter or six months, depending on how much data is being moved around.

I know how it's SUPPOSED to work, but I've also seen stranger things happen in real life ('specially with Compaq that doesn't always conform to the rest of the world.) I've also seen configuration errors in the BIOS trigger this sort of thing before. Had it happen on a SCSI server....

I agree that it "shouldn't" happen, but obviously something did.

mikea · May 29, 2006

kath said:
I'm thinking of buying a RAID array for RAID 5.
You say Compaq/HP Smart Array is good? Are they linux-friendly?

--Kath

All of the hardware-based controllers are good and shouldn't care what OS you put on them because they'll look like a single drive to the OS. You do want to check for support for the OS because with that the controller can signal the OS to tell you you lost disk, and maybe let you hot swap it and rebuild the array while the OS keeps chugging. That's the way it's supossed to work.

The HP Smart Array controller comes with HP servers, either built-in to the motherboard or as an option. They have support for most major OSs, including (Red Hat, for sure) Linux.
A good, inexpensive one is the DL380:
http://h10010.www1.hp.com/wwpc/pscmisc/vac/us/en/ss/proliant/dl380g4-models.html
With 6 drives you can make one a hot spare that will be used as a replacement for a bad one with no human intervention.

I think the name brand controllers like Adaptec and Promise are good. They're just not priced for mere civilians, which is why I don't have one at home.

TMetzinger · May 29, 2006

kath said:
I'm thinking of buying a RAID array for RAID 5.
You say Compaq/HP Smart Array is good? Are they linux-friendly?

--Kath

I've found the HP Smart controllers to be good and OS-agnostic. I've never had one fail to boot. That said, it's a good idea to do PM quarterly, to check integrity. Usually I will validate the status of the array using the OS-based utilities, then reboot and run the controller-hardware utilities.

SCCutler · May 30, 2006

I have a Compaq DL380 (first gen) server with RAID 5, running Red Hat Linux. In five years or so, I have never-but-never had the system glitch or fail, never had to reboot, only shut-downs were for planned events (once to move the server rack to new office, once bacause building power was going to be down for a day).

Had one of the SCSI drives fail (looked at it, saw blinking red light), popped it out and popped in new one and, just like they promised, it started rebuilding itself.

wbarnhill · May 30, 2006

SCCutler said:
I have a Compaq DL380 (first gen) server with RAID 5, running Red Hat Linux. In five years or so, I have never-but-never had the system glitch or fail, never had to reboot, only shut-downs were for planned events (once to move the server rack to new office, once bacause building power was going to be down for a day).

Had one of the SCSI drives fail (looked at it, saw blinking red light), popped it out and popped in new one and, just like they promised, it started rebuilding itself.

Still don't know if the drive we pulled out is dead for sure, but we popped in another 36 gig and it's rebuilding now. Probably just a freak incident

RAID 1 and you.

wbarnhill

Final Approach

wsuffa

Touchdown! Greaser!

wbarnhill

Final Approach

wsuffa

Touchdown! Greaser!

wbarnhill

Final Approach

Brian Austin

En-Route

mikea

Touchdown! Greaser!

Brian Austin

En-Route

wbarnhill

Final Approach

wbarnhill

Final Approach

Brian Austin

En-Route

kath

Administrator

wsuffa

Touchdown! Greaser!

mikea

Touchdown! Greaser!

TMetzinger

Final Approach

SCCutler

Administrator

wbarnhill

Final Approach