Re: FYI: RAID5 unusably unstable through 2.6.14

Phillip Susi wrote:

Your understanding of statistics leaves something to be desired. As youadd disks the probability of a single failure is grows linearly, but theprobability of double failure grows much more slowly. For example:
If 1 disk has a 1/1000 chance of failure, then
2 disks have a (1/1000)^2 chance of double failure, and
3 disks have a (1/1000)^2 * 3 chance of double failure
4 disks have a (1/1000)^2 * 7 chance of double failure

After the first drive fails you have no redundancy, the chance of anadditional failure is linear to the number of remaining drives.

Assume:
  p - probability of a drive failing in unit time
  n - number of drives
  F - probability of double failure

The chance of a single drive failure is n*p. After that you have a new"independent trial" for the failure any one of n-1 drives, so the chanceof a double drive failure is actually:

  F = (n*p) * (n-1)*p

But wait, there's more:
  p - chance of a drive failing in unit time
  n - number of drives
  R - the time to rebuild to a hot spare in the same units as p
  F - probability of double failure

So:

  F = n*p * (n-1)*(R * p)

If you rebuild a track at a time, each track takes the time to read theslowest drive plus the time to write the spare. If the array remains inuse load increases those times.

And the ugly part is that p is changing all the time, there's infantmortality on new drives, fairly constant electronic probability andincreasing probability of mechanical failure over time. If all of yourdrives are the same age they are less reliable than mixed age drives.

Thus the probability of double failure on this 4 drive array is ~142times less than the odds of a single drive failing. As the probably ofa single drive failing becomes more remote, then the ratio of thatprobability to the probability of double fault in the array growsexponentially.
( I think I did that right in my head... will check on a real calculatorlater )
This is why raid-5 was created: because the array has a much lowerprobabiliy of double failure, and thus, data loss, than a single drive.Then of course, if you are really paranoid, you can go with raid-6 ;)

If you're paranoid you mirror over two RAID-5 arrays. The mirrors are onindependent controllers. RAID-10.

Michael Loftis wrote:
Absolutely not. The more spindles the more chances of a doublefailure. Simple statistics will mean that unless you have mirrors themore drives you add the more chance of two of them (really) failing atonce and choking the whole system.
That said, there very well could be (are?) cases where md needs to doa better job of handling the world unravelling.
-

A small graph of the effect of the rebuild time on RAID-5 attached, itassumes probability of failure = 1/1000 per the original post, forvarious rebuild times the probability of failure drops.

--
   -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
 last possible moment - but no longer"  -me

Follow-Ups:
- Re: FYI: RAID5 unusably unstable through 2.6.14
  - From: Pavel Machek <pavel@ucw.cz>

References:
- FYI: RAID5 unusably unstable through 2.6.14
  - From: Cynbe ru Taren <cynbe@muq.org>
- Re: FYI: RAID5 unusably unstable through 2.6.14
  - From: Michael Loftis <mloftis@wgops.com>
- Re: FYI: RAID5 unusably unstable through 2.6.14
  - From: Phillip Susi <psusi@cfl.rr.com>

Prev by Date: [2.6 patch] let CDROM_PKTCDVD_WCACHE depend on EXPERIMENTAL
Next by Date: Re: Development tree, PLEASE?
Previous by thread: Re: FYI: RAID5 unusably unstable through 2.6.14
Next by thread: Re: FYI: RAID5 unusably unstable through 2.6.14
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]