## RAID1 increases chances of disk failure

**fact**

When I was first told this, my initial response was “*rubbish*“, there was no way that using RAID1 (mirroring) could increase the chance of disk failure, was there?

Then I actually thought about the statement again, and **I was wrong**!

Consider 2 hard disks **A** and **B**. The make/model/manufacturer is irrelevant for this *proof*.

Given that **P(e)** is the *probability of event e occurrin*g, we can state that:

P(disk **A** fails) = **a**

P(disk **B** fails) = **b**

Trivially, in a single disk system using either disk **A** or disk **B**, the probability of disk failure will be **a** or **b**, or for ease of demonstration we shall pick the average of the two.

Therefore, the average chance of a single disk machine suffering disk failure is **(a+b)/2**.

Now, consider the RAID1 system containing disks **A** and **B**.

The chance of a disk failing is the probability of disk **A** failing *or* the probability of disk **B** failing; so:

P(disk **A** fails or disk **B** fails) = P(disk **A** fails) + P(disk **B** fails) = **a** + **b**.

In fact, the chance of disk failure in a RAID1 system *doubles*. Hardly surprising if you stop and think about it.

Now, the statement I initially inferred actually was: *RAID1 increases chances of data loss*, which is obviously rubbish, as can be easily shown.

We know the chance of disk failure (and in this case *data loss*) with a single disk, **(a+b)/2**.

Now, using RAID1, the chance of data loss is defined as:

*The probability of both disk A and disk B failing.*

This is defined as:

P(disk **A** failing and disk **B** failing) = P(disk **A** failing) * P(disk **B** failing) = **a * b**.

It can be stated that any probability must be in the range 0 <= P(**e**) <= 1.

For any two numbers **i**,**j**; if both **i** and **j** satisfy 0 <= **i** <= 1; 0 <= **j** <= 1 then:

**i * j < i**; **i * j < j**

Therefore, the probability of *data loss* is lower when using RAID1, however the chance of *disk failure* doubles!

May 23rd, 2008 at 10:36 pm

Actually, Andy, I think there are several misunderstandings here!

My comparison was between a single disk and a RAID 1 system.

The data loss probability for a single disk is the same as the rate of failure – call it P(disk).

The data loss probability for a RAID 1 system is the sum of the failure probabilities of both disks, and the probabilities of one disk failing and the RAID controller failing to reconcile, plus the probability of the RAID controller failing catastrophically. That’s P(disk) ^ 2 + 2 * P(disk) + 2 * P(controller, reconcile) + P(controller, catastrophic). And that’s a very primitive model. You’d also have to factor in the false sense of security RAID creates.

It’s also worth pointing out that I was talking specifically about consumer level systems, not professional, though I did not make that explicit.

So, we’ve established that the chance of disk failure is greater with RAID 1, and that the chance of data loss if higher if P(disk) ^ 2 + 2 * P(disk) + 2 * P(controller, reconcile) + P(controller, catastrophic) > P(disk), which I can’t be bothered calculating since I am drunk.

May 23rd, 2008 at 10:37 pm

Math error. The long equation should be:

P(disk) ^ 2 + 2 * P(disk) * P(controller, reconcile) + P(controller, catastrophic)