Re: limits on raid — Linux Kernel

Neil Brown wrote:

On Friday June 15, [email protected] wrote:

                                                  As I understand the way
raid works, when you write a block to the array, it will have to read all
the other blocks in the stripe and recalculate the parity and write it out.


Your understanding is incomplete.


Does this help?
[for future reference so you can paste a url and save the typing for code :) ]

http://linux-raid.osdl.org/index.php/Initial_Array_Creation

David



Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activityis what's called the "initial resync".

The kernel takes one (or two for raid6) disks and marks them as 'spare'; it thencreates the array in degraded mode. It then marks spare disks as 'rebuilding'and starts to read from the 'good' disks, calculate the parity and determineswhat should be on any spare disks and then writes it. Once all this is done thearray is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this ishappening (it is however fully useable).


--assume-clean

Some people have noticed the --assume-clean option in mdadm and speculated thatthis can be used to skip the initial resync. Which it does. But this is a badidea in some cases - and a *very* bad idea in others.


raid5

For raid5 especially it is NOT safe to skip the initial sync. The raid5implementation optimises use of the component disks and it is possible for allupdates to be "read-modify-write" updates which assume the parity is correct. Ifit is wrong, it stays wrong. Then when you lose a drive, the parity blocks arewrong so the data you recover using them is wrong. In other words - you will getdata corruption.

For raid5 on an array with more than 3 drive, if you attempt to write a singleblock, it will:


    * read the current value of the block, and the parity block.

* "subtract" the old value of the block from the parity, and "add" the newvalue.

    * write out the new data and the new parity.

If the parity was wrong before, it will still be wrong. If you then lose adrive, you lose your data.


linear, raid0,1,10

These raid levels do not need an initial sync.

linear and raid0 have no redundancy.

raid1 always writes all data to all disks.

raid10 always writes all data to all relevant disks.


Other raid levels

Probably the most noticeable effect for the other raid levels is that if youdon't sync first, then every check will find lots of errors. (Of course youcould 'repair' instead of 'check'. Or do that once. Or something.)

For raid6 it is also safe to not sync first, though with the same caveat. Raid6always updates parity by reading all blocks in the stripe that aren't known andcalculating P and Q. So the first write to a stripe will make P and Q correctfor that stripe. This is current behaviour. There is no guarantee it will neverchanged (so theoretically one day you may upgrade your kernel and suffer datacorruption on an old raid6 array).


Summary

In summary, it is safe to use --assume-clean on a raid1 or raid1o, though a"repair" is recommended before too long. For other raid levels it is best avoided.


Potential 'Solutions'

There have been 'solutions' suggested including the use of bitmaps toefficiently store 'not yet synced' information about the array. It would bepossible to have a 'this is not initialised' flag on the array, and if that isnot set, always do a reconstruct-write rather than a read-modify-write. But thefirst time you have an unclean shutdown you are going to resync all the parityanyway (unless you have a bitmap....) so you may as well resync at the start. Soessentially, at the moment, there is no interest in implementing this since theadded complexity is not justified.


What's the problem anyway?

First of all RAID is all about being safe with your data.

And why is it such a big deal anyway? The initial resync doesn't stop you fromusing the array. If you wanted to put an array into production instantly andcouldn't afford any slowdown due to resync, then you might want to skip theinitial resync.... but is that really likely?


So what is --assume-clean for then?

Disaster recovery. If you want to build an array from components that used to bein a raid then this stops the kernel from scribbling on them. As the man page says :


"Use this ony if you really know what you are doing."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: limits on raid
  - From: Neil Brown <[email protected]>
- Re: limits on raid
  - From: dean gaudet <[email protected]>

References:
- limits on raid
  - From: [email protected]
- Re: limits on raid
  - From: Neil Brown <[email protected]>
- Re: limits on raid
  - From: [email protected]
- Re: limits on raid
  - From: Neil Brown <[email protected]>
- Re: limits on raid
  - From: Wakko Warner <[email protected]>
- Re: limits on raid
  - From: Neil Brown <[email protected]>

Prev by Date: Re: regression tracking (Re: Linux 2.6.21)
Next by Date: Re: limits on raid
Previous by thread: Re: limits on raid
Next by thread: Re: limits on raid
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]