Title of Invention

A METHOD FOR DISTRIBUTING PARITY ACROSS A STORAGE ARRAY

Abstract ABSTRACT A METHOD FOR DISTRIBUTING PARITY ACROSS A STORAGE ARRAY A method for distributing parity across a storage array, the method comprising the steps of: adding a new storage device to a number of pre-existing storage devices of the array; dividing each storage device into blocks, the blocks being organized into stripes such that each stripe contains one block from each storage device; and distributing, by a processor (122) of a storage system (200) parity among blocks of the new and pre-existing storage devices without recalculation or moving of any blocks containing data, the method characterized in that the distribution of parity is by reassigning every Nth parity block (P) from each of the pre-existing storage devices (202-206) to the new storage device (208) to arrange each storage device of the array with approximately 1/N of the parity blocks (P), where N is equal to the number of the pre- ^ existing storage devices (202-206) plus the new storage device (208). Fig: 2 26
Full Text The present invention relates to a method for distributing parity across a storage array.
FIELD OF THE INVENTION
The present inventionrelates to airays of storage systems and, more specifically,
to a system tliat efficiently assigns parity blocks within storage devices of a storage
array,
BACKGROUND OF THE INVENTION
A storage system typically comprises one or more storage devices into which
information may be entered, and ftom which information may be obtained, as desired.
' The storage system includes a storage operating system tliatflmctionally organizes the
^ system by, Mer alia, invoking storage operations m support of a storage service implemented
by the system. The storage system may be implemented in accordance with
' a variety of storage architectures including, but not limited to, a network-attached storage
enviromnent, a storage area network and a disk assembly directly attached to a client
or hbst computer. The storage devices arc typically disk drives organized as a disk
ai-ray, wherein the term "disk" conunonly describes a self-contauied rotating magnetic
media storage device. The term disk hi this context is synonymous with hard disk drive . '
(HDD) or direct access storage device (DASD).
Storage of information on the disk aiiay is preferably hnplemented as one or '
more storage "voliunes" that comprises a cluster of physical disks, defining an overall
logical arrangement of disk space. The disks within a volume are typically .organized
as one or more groups, wherein each gimip is operated as a Redundant Array of Inde-
^ pendent (or Inexpensive) Disks (RAID), Tn this context, a RAID group is defined as a
number of disks and an address/block space associated with those disks. The term
"RAID" and its various hnplementattons are well-laiown and disclosed in A Case for
Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and
R. H. Katz, Proceedmgs of the International Conference on Management of Data .
(SIGMOD), June 1988.
2
The storage operating system of the storage system may implement a file system
to logically organize the information as a hierarchical structure of directories, files
and blocks on the disks. For example, each "on-disk" file may be implemented as set
i of data structures, i.e., disk blocks, configured to store information, such as the actual
data for the file. The storage operating system may also implement a RAID system that
'• ' manages the storage and retrieval of the information to and from the disks in accordance
with write and read operations. There is typically a one-to-one mapping between
the mfomiation stored on the disks in, e.g., a disk block number space, and the informa-
1^ tion organized by the file system in, e.g., volume block number space.
A common type of file system is a "vwite in-place" file system, an example of
which is the conventional Berkeley fast file system. In a write in-place file system, the
locations of the data structures, such as data blocks, on disk are typically fixed.
Changes to the data blocks are made "in-place"; if an update to a file extends the quan-
^ tity of data for the file, an additional data block is allocated. Another type of file system
is a write-anywhere file system that does not overwrite data on disks. If a data
block on disk is retrieved (read) firom disk into a memory of the storage system and
"dirtied" with new data, the data block is stored (written) to a new location on disk to
; thereby opthnize write performance. A write-anywhere file system may initially assume
an optimal layout such that the data is substantially contiguously arranged on
disks. The optimal disk layout results in efficient access operations, particularly for
W sequential read operations, directed to the disks. An example of a vwite-kn3nvhere file
system that is configured to operate on a storage system is the Write Anywhere File
Layout (WAFL™) file system availabie fiom Network Appliance, Inc., Sunnyvale,
' Califbmia.
: Most RAID implementations enhance the reliability/integrity of data storage
through the redundant writing of data "stripes" across a given number of physical disks
in the RAID group, and the appropriate storing of redundant information with respect to
, the striped data. The redundant information, e.g., parity information, enables recovery
: of data lost when a disk fails. A parity value may be computed by sumnung (usually
modulo 2) data of a particular word size (usually one bit) across a number of similar
disks holding different data and then stormg the results on an additional shnilar disk.
-

- - ,
That is, parity may be computed on vectors 1-bit wide, composed of bits in correspond-'
ing positions on each of the disks. When computed on vectors 1 -bit wide, the parity
; can be either the computed sum or its complement; these are referred to as even and
odd parity respectively. Addition and subtraction on 1-bit vectors are both equivalent
to exclusive-OR (XOR) logical operations. The data is then protected against the loss
; of any one of the disks, or of any portion of the data on any one of the disks. If the disl^
- storing the parity is lost, the parity can be regenerated from the data. If one of the data
disks is lost, the data can be regenerated by adding the contents of the surviving data
• ^ disks together and then subtracting the result from the stored parity.
Typically, the disks are divided into parity groups, each of which comprises one
or more data disks and a parity disk. A parity set is a set of blocks, including several
data blocks and one parity block, where the parity block is the XOR of all the data
blocks. A parity group is a set of disks from which one or more parity sets are selected.
i
The disk space is divided into stripes, with each stripe containing one block from each
disk. Theblocksof a stripe are usually at the same locations on each disk in the parity
group. Within a stripe, all but one block contains data ("data blocks"), while the one
block contains parity ("parity block") computed by the XOR of all the data.
As used herein, the term "encoding" means the computation of a redundancy
value over a predetermined subset of data blocks, whereas the term "decoding" means
^ the reconstruction of a data or parity block by the same process as the redundancy
computation using a subset of data blocks and redundancy values. If one disk fails in
the parity group, the contents of that disk can be decoded (reconstructed) on a spare
disk or disks by adding all the contents of the remaining data blocks and subtracting the
result from the parity block. Since two's complement addition and subtraction over 1-
• bit fields are both equivalent to XOR operations, this reconstruction consists of the
XOR of all the surviving data and parity blocks. Shnilarly, if the parity disk is lost, it
can be recomputed in the same way from the surviving data.
. • . •• • .
If Cie parity blocks are all stored on one disk, thereby providing a single disk
that contains all (and only) parity information, a RAID-4 level implementation is provided.
The RAID-4 implementation is conceptually the simplest form of advanced
RAID (i.6i, more than stripmg and mirroring) since it fixes the position of the parity
information in each RAID group. In particular, a RAID-4 implementation provides
protection from single disk errors with a single additional disk, while making it easy to
incrementally add data disks to a RAID group.
: If the parity blocks are contained within different disks in each stripe, in a rotat-
\ ing pattern, then the implementation is RAID-5. Most commercial implementations
that use advanced RAID techniques use RAID-5 level implementations, which distribute
the parity mformation. A motivation for choosing a RAID-5 implementation is
that, for most static file systems, using a RAID-4 implementation would limit write
w throughput. Such static file systems tend to scatter write data across many stripes in the
disk array, causing the parity disks to seek for each stripe written. However, a writeanywhere
file system, such as the WAFL file system, does not have this issue since it
concentrates write data on a few nearby stripes.
Use of a RAID-4 level implementation in a write-anywhere file system is a desirable
way of allowing incremental capacity increase ^vhile retaining performance;
however there are some "hidden" downsides. First, where all the disks in a RAID
group are available for servicing read traffic in a RAID-5 implementation, one of the
i disks (the parity disk) does not participate in such traffic in the RAID-4 implementation.
Although this effect is insigniBcant for large RAID group sizes, those group sizes
have been decreasing because of, e.g., a limited number of available disks or increasing
^ reconstruction times of larger disks. As disks continue to increase in size, smaller
RAID group configurations become more attractive. But this increases the fraction of
; disks unavailable to service read operations in a RAID-4 configuration. The use of a
RAID-4 level implementation may therefore result in significant loss of read operations
per second. Second, when a new disk is added to a fiiU volume, the write anywhere file
system tends to direct most of the write data traffic to the new disk, which is where
most of the free space is located.
.
The RAID system typically keeps track of allocated data in a RAID-5 level implementation
of the disk array. To that end, the RAID system reserves parity blocks hi
a fixed pattern that is simple to compute and that allows efficient identification of the
non-data (parity) blocks. However, adding new hidividual disks to a RAID group of a
RAID-5 level implementation typically requires repositioning of the parity mformation
across the old and new disks in each stripe of the array to maintain the fixed pattern.
- Repositioning of the parity information typically requires use of a complex (and costly)
parity block redistribution scheme that "sweeps-through" the old and new disks, copy-
: - ing both parity and data blocks to conform to the new distribution. The parity redistril
bution scheme ftirther requkes a mechanism to identify which blocks contain data and
• to ensure, per stripe, that there are not too many data blocks allocated so that there is
sufficient space for the parity mformation. As a result of the complexity and cost of
' such a scheme, most RAID-5 implementations relinquish the ability to add individual
m diskstoaRAIDgroupand,mstead, use a fixed RAE) group size (usually in the 4-8
disk range). Disk capacity is then increased a full RAID group at a tune. Yet, the use
of small RAID groups translates to high parity overhead, whereas the use of larger
RAID groups means having a high-cost for mcremental capacity.
Therefore, it is desirable to provide a distribution system that enables a storage
system to distribute parity evenly, or nearly evenly, among disks of the system, while
retaining the capability of incremental disk addition.
In addition, it is desirable to provide a distribution system that enables a write
anywhere file system of a storage system to run with better performance in smaller
(RAID group) configurations.
^ SUMMARY OF THE INVENTION
The present invention overcomes the disadvantages of the prior art by providing
a semi-static distribution technique that distributes parity across disks of an array. According
to an illustrative embodiment of the technique, parity is distributed (assigned)
across the disks of the array in a manner that maintains a fixed pattern of parity blocks
among stripes of the disks. When one or more disks are added to the array, the semistatic
technique redistributes parity in a way that does not require recalculation of parity
or moving of any data blocks. Notably, the parity information is not actually moved;
the technique merely involves a change in the assignment (or reservation) for some of ,
^ the parity blocks of each pre-existmg disk to the newly added disk. For example, a preexisting
block that stored.parity on, e.g., a first pre-existing disk, may continue to store
parity; alternatively, a block on tiie newly added disk can be assigned to store parity for
the stripe, which "frees up" the pre-existing parity block on the first disk to store file
: ' system data.
• • Advantageously, semi-static distribution allows those blocks that hold parity (in
; the stripe) to change when disks are added to the array. Reassignment occurs among
'] blocks of a stripe to rebalance parity to avoid the case where a disk with a preponder-
: anceofparitygets"hot", i.e., more heavily utilized than other disks, during write traf-
-
; fic. The novel distribution technique applies to single disk failure correction and can be
s extended to apply to double (or greater) disk loss protection. In addition, the senii-
^ static distribution technique has the potential to unprove performance in disk-bound
configurations while retainmg the capability to add disks to a volume one or more disks
at a time.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and further advantages of the invention may be better understood by
referring to the following description in conjunction with the accompanying drawings
in which like reference niraierals indicate identical or functionally similar elements:
Fig. 1 is a schematic block diagram of a storage system that may be advantageously
used with the present invention;
Fig. 2 is a schematic diagram of a disk array illustrating parity assignments ac-

' cording to a semi-static distribution technique of the present invention;
Fig. 3 is a flowchart illustrating a sequence of steps for distributing parity

among disks of an array in accordance with an illustrative embodiment of the semistatic
distribution teclinique; and
Fig. 4 is a diagram of a parity assignment table illustrating a repeat interval for
various group sizes in accordance with the semi-static distribution technique.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
Fig. 1 is a schematic block diagram of a storage system 100 that may be advantageously
used witli the present invention. In the illustrative embodiment, the storage
system 100 comprises a processor 122, a memory 124 and a storage adapter 128 interconnected
by a system bus 125. The memory 124 comprises storage locations that are
addressable by the processor and adapter for storing software program code and data
; structures associated with the present invention. The processor and adapter may, in
; turn, comprise processing elements and/or logic circuitry configured to execute the
\ software code and manipulate the data structures. It will be apparent to those slcilled in
; the art that other processing and memory means, including various computer readable
; media, may be used for storing and executing program instructions pertaining to the
' inventive technique described herein.
i A storage operating system 150, portions of which are typically resident in
^k memory and executed by the processing elements, ftmctionally organizes the system
• 100 by, inif^erfl/za, invoking storage operations executed by the storage system. The
storage operating system implements a high-level module to logically organize the information
as a hierarchical structure of directories, files and blocks on disks of an array.
The operating system 150 ftirther implements a storage module that manages the storage
and retrieval of the information to and ftom the disks in accordance with write and
read operations. It should be noted that the high-level and storage modules can be implemen.
ted in software, hardware, fumware, or a combination thereof.
' Specifically, the high-level module may comprise a file system 160 or other
i module, such as a database, that allocates storage space for itselfin the disk array and .
that controls the layout of data on that array. In addition, the storage module may comprise
a disk array control system or RAID system 170 configured to compute redundant
W (e.g., parity) information usmg a redundant storage algorithm and recover from disk
failures. The disk array control system ("disk array controller") or RAID system may
further compute the redundant information using algebraic and algorithmic calculations
in response to the placement of fixed data on the array. It should be noted that the term
•. . ' "RAID system" is synonymous with "disk array control system" or "disk array controller"
and, as such, use of the term "RAID system" does not imply employment of one of
i inventive semi-static parity distribution technique. As described herein, the file system
J or database makes decisions about where to place data on the array and forwards tho se
decisions to the RAID system.
In the illustrative embodiment, the storage operating system is preferably the
NetApp® Data ONTAP'^" operating system available from Network Appliance, Inc.,
Sunnyvale, California that implements a Write Anywhere File Layout (WAFL™) file
system having an on-disk format representation that is block-based using, e.g., 4 kilobyte
(kB) WAFL blocks. However, it is expressly contemplated that any appropriate
storage operating system mcluding, for example, a write in-place file system may be
enhanced for use in accordance with the inventive principles described herein. As
such, where the term "WAFL" is employed, it should be taken broadly to refer to any
0^ storage operating system that is otherwise adaptable to the teachings of this invention.
i • •
As used herem, the term "storage operating system" generally refers to the, computer-
executable code operable to perform a storage function in a storage system, e.g.,
that manages file semantics and may, in the case of a file server, implement file system
, semantics and manage data access. In this sense, the ONTAP software is an example
,; of such a storage operating system implemented as a microkernel and including a
WAFL layer to implement the WAFL file system semantics and manage data access.
The storage operating system can also be implemented as an application program oper-•
ating over a general-purpose operating system, such as UNIX® or Windows NT®, or
as a general-purpose operating system with configurable fimctionality, which is configured
for storage applications as described herein. ,
• ^ The storage adapter 128 cooperates with the storage operating system 150 executing
on the system 100 to access information requested by a user (or client). The information
may be stored on any type of attached array of writeable storage device me-
: dia such as video tape, optical, DVD, magnetic tape, bubble memory, electronic ran- ,
: _ dom access memory, micro-electro mechanical and any other sknilar media adapted to
store mformation, including data and parity information. However, as illustratively described
hereui, the mformation is preferably stored on the disks, such as HDD and/or
DASD, of array 200. The storage adapter includes input/output (I/O) interface chcuitry
tiiat couples to the disks over an I/O mterconnect arrangement, such as a conventional
high-performance. Fibre Channel serial link topology.
Storage of information on array 200 is preferably unplemented as one or more
storage "volumes" (e.g., VOLl-2 140) that comprise a cluster of physical storage disks,
generally shown at 130 and defining an overall logical arrangement of disk space.
Each volume is generally, although not necessarily, associated with its own file system.
' The disks within a volume/file system are typically organized as one or more groups,
wherein each group is comparable to a RAID group. Most RAID implementations enhance
the rehability/integrity of data storage through the redundant writing of data
"stripes" across a given number of physical disks in the RAID group, and the appropriate
storing of parity information with respect to the striped data.
Specifically, each volume 140 is constructed from an array of physical disks
4 ^ 130 that are divided into blocks, with the blocks being organized into stripes. The disks
* ' are organized as groups 132,134, and 136. Although these groups are comparable to
' RAID groups, a semi-static distribution technique described herein is used within each
group. Each stripe in each group has one or more parity blocks, depending on the degree
of failure tolerance required of the group. The selection ofwhichdisk(s) in each
stripe contains parity is not determined by the RAID configuration, as it would be in a
conventional RAID-4 or RAID-5 array.

The present invention relates to the semi-static distribution technique that distributes
parity across disks of an array. The inventive technique is preferably implemented
by the RAID system 170 that, among other things, computes parity in stripes
across the disks, distributes the parity among those stripes as described herein and reconstructs
disks lost as a result of failure. The semi-static distribution technique does
.^P not.require the participation of the file system 160 and, as such, is also suitable for deployment
in RAID code embodied as, e.g., a RAID controller that may be internally or
externally coupled to the storage system 100.
According to the technique, parity is distributed (assigned) across the disks of
the array in a manner that maintains a fixed pattern of parity blocks among stripes of
the disks. When one or more disks are added to the array, the semi-static technique redistributes
parity in a way that does not require recalculation of parity or moving of any
data blocks. Notably, tlie parity information is not actually moved; the technique
merely involves a change in the assignment (or reservation) for some of the parity
. blocks of each pre-existing disk to the newly added disk. For example, a pre-existing
block that stored parity on, e.g., a first pre-existing disk, may continue to store parity;
alternatively, a block on the newly added disk can be assigned to store parity for the
stripe, which "frees up" the pre-existmg parity block on the first disk to store file system
data. Note that references to the file system data do not preclude data generated by
other high-level modules, such as databases.
Assuming data is allocated densely across the disks of array 200, the storage
operating system 150 can choose to assign parity evenly across the disks in a fixed pattern.
However, the fixed pattern changes when one or more disks are added to the array.
In response, the semi-static distribution technique redistributes (reassigns) parity
^ in a manner that maintains a fixed pattern ofparity blocks among the stripes of the
disks. Note that each newly added disk is initialized to a predetermined and fixed
value, e.g., zeroed, so as to not affect the fixed parity of the stripes. It should be further
noted that the fixed parity may be even or odd, as long as the parity value is known
(predetermined); tlie following description herein is directed to the use of even parity.
In addition, initializing of the newly added disk allows reassignment ofparity blocks in

some stripes (e.g., 1/N of the stripes, where N is equal to the number of disks) to the
I new disk without any calculation or writing ofparity.
According to the invention, the reassigrmient algorithm only ever changes a parity
block to a data block and never changes a data block to a parity block. For example,
in response to adding a new Nth disk to a group 132-136, the file system 160 can
reassign every Nth parity block of each existing disk to the new disk. Such reassign-
W ment does not require any re-computation or data movement as the new disk only contains
free blocks and parity blocks, so existing parity blocks can get reassigned for use
for data, but not vice versa. This reassignment (construction) algorithm forms a pattern
ofparity that is deterministic for each group size and evenly distributes parity among
all the disks in the group.
Fig. 2 is a schematic diagram of disk array 200 illusfrating parity assigiunents
according to the semi-static distribution technique of the present invention. Assume the
~ array 200 kiitially comprises one disk 202 and it is desirable to store redundant (parity)
information; therefore, each block on the disk stores parity (?) information. When a
second disk 204 is added to expand the array, the parity blocks may be distributed between
the two disks. Likewise, when a third disk 206 and, thereafter, a fourth disk 208
are added to the expanded array, the parity blocks may be distributed among those
disks. As disks are added to the array 200, parity is not stored in a block that contains
file system data. The semi-static distribution technique is directed to only reassigning
parity blocks, which frees up blocks to use for data. In other words, the technique
never reassigns a data block, which is in contrast to the expansion of conventional
RAID-5 level implementations.
Parity may be distributed among the disks in accordance with a construction algorithm
of the inventive technique that reassigns one of N parity blocks from each pre-
^ existing disk to the new disk, wherein N is equal to the number of disks in the expanded
array. Overall, one of N parity blocks is reassigned to the new disk, with each preexisting
disk continuing to hold exactly 1/N of the parity blocks in tlie expanded array.
For a 2-disk array, every other parity block on the first disk 202 is moved to the second
disk 204, When the third disk 206 is added to the expanded array 200, thereby creating
a 3-disk array, every third remaining parity block on the first disk 202, as well as every
third parity block on the second disk 204, is moved to the third disk 206. When the
fourth disk 208 is added to the array, creating a 4-disk array, every fourth remaining
parity block firom each disk (disks 1-3) is moved to the fourth disk 208. As a result of
this reassignment, the amount of parity on each disk is substantially the same. The location
of the parity block also changes from stripe to stripe across the disks of the array
in a predictable and deterministic pattern.
^ Fig. 3 is a flowchart illustrating a sequence of steps for distributing parity
among disks of an array in accordance with an illustrative embodiment of the semistatic
distribution technique of the present invention. Here, a new Nth disk is added to
a group 132-13 6 of the array and, as described above, one out of every N parity blocks
is assigned to the new disk, wherein N is equal to the number of disks in the array. As
• noted, there is no need to actually move the parity information among the disks; the inventive
semi-static distribution technique contemplates merely a change in the assignment
(or reservation) for each parity block on the newly added disk.
The sequence starts in Step 300 and proceeds to Step 302 where the new disk is
added to the group of N disks in the array, hi Step 304, the new disk is initialized (e.g.,
zeroed) to ensure that the parity of the blocks on each stripe is unaffected. There may


be multiple blocks witliin a stripe that do not contain data (i.e., unallocated data blocks)
and that could potentially store parity. The stripe will contain at least one unallocated
block, which is the-parity block, and one or more unallocated blocks that are fireed data
blocks. All blocks contribute to, e.g., even parity, so the parity block(s) and. the freed
•, . . data blocks ai-e all equivalent. The file system (or high-level module, if there is no file
. system) detemiines which disks contain fi:ee blocks in the stripe in response to a write :
request to store write data in the stripe. In Step 306, the file system 160 reserves as
many free blocks as required by the redundant storage algoritlim to store parity, arbitrarily.
For example, a pre-existing block that stored parity on, e.g., a first pre-existing
disk, may continue to store parity; ahematively, a block on the newly added disk can be
^ • assigned to store parity for the stripe, which "frees up" the pre-existing parity block on
I ' • ' the fust disk to storp the data.
Note that any parity algoritlim that protects against two (or more) disk failures
, • may be used with the semi-static distribution technique, as long as the algorithm.allows
any two (or more) blocks in the stripe to store the parity. An example of a double failure
correcting algoritlmi that may be advantageously used with the present invention is
,• udfonu and symmetric row-diagonal (SRD) paritj' described in U.S. VatentAppUca.-
tlonSeriallio. (112056-0141) titled Uniform, andSymmeMcDoubleFailw-e Corncting
Technique for Protecting against TM'O'Disk Faihtres in a Disk Array,hyVeXe.x'V.
Corbett et al. Here, the inventive teclmique is not dependent upon the unifbiinity or
symmetry of the parity algoritlmi, although it can take advantage of it. Wlien using a
i W double failure correcting algorithm with the semi-static distribution teclmique, the file
, system reserves'two unallocated data blocks to be assigned to store parity. Anonunifonn
double or higher failure correcting algoritlim can be used since the location of
the parity blocks is known deterministically. However, using such an algoritlim may
sacrifice the advantage that parity need not be recalculated when a disk is added to the
array.
Another teclmique is to employ the non-uoiform algorithm such that data blocks
. are written to any of the blocks of the array, even those that typically would be used to
, . store.redmidant information. Since the multiple failure correcting algorithm can restore
the contents of any missing disks, the remaining blocks can.be used to store redundant
information, even if they are constructed using tlie technique usually intended to reconstruct
lost data blocks. Using a non-unifomi algoritlim in tliis way may result in an implementation
that is much more complex tlian can be achieved by using a unifomi and
symmetric algorithm, such as SRD.
In Step 308, the write allocator 165 of the file system arranges the write data for
storage on the disks in the stripe. In Step 310, the file system provides an indication of
the reserved block(s) to the RAID system (storage module) via a write request message
issued by the file system. In Step 312, the RAID system provides the parity information
(and write data) to the disk driver system for storage-pn the disks, hi particular, in
Step 314, the parity is distributed among the blocks of the' disks such that L'N of the
^ paiity blocks is stored on each disk to thereby balance the 'data across the disks of the
' array. Moreover, tlie locations of the parity blocks "move" among the stripes of the
array in a predictable pattern that appears complicated, but is easy to compute. The sequence
then ends at Step'316.,
Additional teclmiques by which a balanced semi-static distribution of redundant
or parity blocks can be acliieved in a double failure con-ecMng array that has two redimdant
blocks per stripe includes a teclmique that simply replaces each single disk in a
singlefailirrecorrectingsemi-staticarraywithapair of di^ks in the double failure cor- ' '
rectihg array. Here, the role of each pair of disks is identical to the role of the corre-
• I spending single disk in the single failure-correcting array. Balance is maintained by . .
using the same number of rows used in the single failure-correcting array; however, this
1 ^ teclinique-is limited to adding disks to the array in multiples of two.
- Anotlrer teclinique constiucts a balanced or nearly balanced array by starting
with two initial ("old") disks that are completely filled with parity blocks, then adding a
third disk and moving every third parity block fioni each of the two initial disks to the
new disk. This technique distributes one-third of the paiity blocks to each disk, occupying
two-thh-ds of the space on each disk, Wlien reassigning parity blocks fi-om an old
disk, it may be discovered that the block on the new disk has already been designated
as parity. In tliis case, tlie next possible parity block is reassigned fiom the old disk to
i the new disk, at the next row where the new disk does not yet contain parity and the old
disk does.
I
This latter technique can be further exti-apolated to build a deterministic set of .
parity assignments for saiy number of disks, with two redundant (e.g., parity) blocks pei:
' stripe and with the redundant blocks balanced or nearly balanced across the array.
Similarly, for three or greater numbers of redundant blocks per stripe, the same technique
can be employed to determine a placement of redundant blocks in a larger array
of any size, in such a way that the. number of redundant blocks per disk is balanced or
neaidy balanced. Moreover, 'the technique allows any number of disks to be added
• without ever changing a data block into a parity block, while continuing to keep the
number of redimdant blocks per disk balanced or nearly balanced.
Other similar tecimiques can be developed to determine the roles of blocks as
I ^ data blocks or redundant blocks ui any size array, wlrile preserving the property that the
' array can be expanded incrementally as the distribution of both data and redundant
blocks are kept balanced or nearly balanced, and without ever changing a data block
"..-•• ..-.into a redundant block. Any of these assignment tecimiques can be implemented'by •
^, storing.or.generatmg.a..data.structure{e,g,,..ajable)i^
mentsfor a specific number ofrows in an. airay of specific size. It is also possible to
. store in a single table all possible assigmnents of redundant blocks for any array size'up
to a certain limit. Here, for example, the table may store a bitmap for each row, where
the one (or more) highest numbered bit sdt is selected that is less than N, wherem N is
the number of disks in the array. In general, any table-based parity assignment that
maintains balance of distributed data and redundant blocks, while allowing expansion
I w without changing data blocks to redundant (parity) blocks, is contemplated by the present
invention, regardless of the number of redundant blocks per row (i.e., the number
of failures the array can tolerate).
The parity assignments for tire semi-static distribution teclmique are calculated
for a known size of a group 132-136 of the disk array or for a maximum group size of
the array; either way, as noted, the calculated parity assignments may be stored in a ta-
.. ble. A parity distribution pattern defined by the stored assigrmients and,, in particular, a
repeat interval of the pattern can be used to determine the location of parity storage on
. any disk in the array for a given group size and for a given stripe. That is, ^the pattern
can be used to indicate which block in each stripe is used for parity or a different pattern
can be used for several stripes.
Fig. 4 is a diagram'of a parity assignment table 400 illustrating the repeat interval
for various group sizes in accordance with the-semi-static distribution technique.
The parity distribution pattem repeats at a repetition interval dependent iipon the gi-oup
size of the array. If a group of sizeN repeats every K stripes tlien the group of size
(N+1) will repeat in the smallest number that both K and (N+1) evenly divide. Notably,
the content of the table does not repeat until it reaches a number (repeat interval)
dependent on the value of N, where N equals the nimiber of disks. For example, m a 2-
disk array (i.e., a group size of two), the parity distribution pattem repeats every two
'\ ^ stripes. Wlien a third disk is added (for a group size of three), tlie parity pattern repeats
every six stripes. Wlien a fourth disk is added (for a gi'oup size of fom-), the parity pattem
repeats every twelve stripes. It can be seen from table 400 that for a group size of ' •
•• • . five (and six), the parity pattem repeats every sixty stripes.
The repeat interval as a function of group size is determined in accordance with
the set of imique prime factors ("primes") up to N, where N equals the nmnber of disks.
The repeat interval (which is equivalent to the number of entries in table 400) is less
than N factorial and, in fact, is equal to the product of all primes less than ot equal to N,
with each prime raised to tlie largest power possible such that the result is less thaii or
equal to N. As some of the numbers between one and N.are prime numbers, it is clear

that the repeat interval may get large, making the table large. For example, for N = 10,
the table size is 2^3 x ^'^2 x 5^1 x 7^1 = 8 x 9 x 5 x 7 = 2520. Similarl^^ for N = 32,
tlie table size is 2^5 x 3'^3 x 5'^2 x 7^1 x 11-^1 x 13^1 x 17^1 x 19^1 x 23^1 x 29^1 x
31^1 = 32 x 27 X 25 X 7 x 1 Ix 13 x 17 X 19 X 23 x 29 X 31 s 144 X 10^12.
A tradeoff may then be made between the table size of tlie pattern and precision
of balancing; the table can be temiinated at a reasonable point and the group size at that
particular repeat inteival can be used. Thereafter, even if there are more disks tha4 the
group size, the technique can. continue to repeat the pattern and still realize nearly • ^ -
form balance of data across the array within, e.g., a half percent, For example, as noted
above, a group size often translates into a parity distrib.ution pattem that repeats every
2,520 stripes. A table of this size (i.e., 2,520 entries) is relatively compact in memory
K
124 and can be computed relatively quickly at start-up using appropriate sofhvai-e code.
In contrast, the table for a group size of 32 (i.e., 144 x 10^^12 entries) is too large to
store in memory.
The.2,520 entry table works well with any reasonable nmnber of disks to pro-
• vide good data balance; however, it should be noted that tliis size table is not the only
choiceandother sized tables may also be used. The 2,520 entry pattern is perfectly '
balanced for N disks up to ten; for N greater than 10, the pattern provides good data
balance even though the pattern has not repeated. In other words, although tlie parity
assignment table for a 17-disk group is rather large (7.7MB with 5 bits per pattern), if
only a fraction of the table is iised, good parity balance can still be achieved. Cutting
(• ^ off the pattern at 2,520, for example, yields perfect balance for all group sizes up to 10
' disks, and less than 1% imbalance to larger groups while limiting the table size to 2520
• X4 bits =1260 bytes for N - 1 1 and 5x2520 bits = 1,575 bytes forN= 17 to 32.
• ' . The parity assignment table 400 can be encoded as a single number indicating a ' •
bit position of parity for a particular value of N. The table could also he coded as a bit
vector, with one or two (or more) bits set indicating the position of a single or double
' (or greater) parity block providing single or double (or greater) disk failure protection. • •
Moreover, the table can be encoded as a single table indicating (for all disk array sizes •
up to some limit, e.g., 32 disks) what disks possibly contain parity in each stripe. The
determination of which disk actually contains parity for a specific value of N is then
^ made by masking off the high order 32-N bits and selecting the highest order remaining
' one or two (or more) bits.
(,
In sum, semi-static distribution strives to keep the number of data blocks per
disk roughly matched across the array to thereby "spread" the read load across all disks
of the array. As a result, the technique eliminates any "bottleneck" in the array caused
by throughput of any single disk in the array, while also eliminating the parity disk(s)
as hot spot(s) for write operations.. The general technique can be applied using a sj'mmetric
algoritlmi, such as SRD parity, or an asynmiefric double failure-correctmg algorithm,
such as Row-Diagonal (RD) parity. The RD parity technique is described in
U.S. Patent Apphcation Serial No. 10/035,607 titled i?ow-Z3za^o;2a/?cr//y rec/777fgwe
for Enabling Efficient Recoyeiyfrom Double Failures in a Storage Array', by Peter F.
Corbett et al, filed on December 28, 2001.
Wlien employing a non-urdform algoritliin, such as RD parity, tlie role of the
disk in storing either data or redundant blocks in any particular block might be ignored
• .. .. with respect to the typical role ofthe disk in the asyinmetric parity algorithni. Since
any double failure correcting algoritlim can construct missing "data" for any two missing
disks of an array, the contents of all the blocks in the row that are assigned the role
of storing data are fixed and the contents of the bivo redundaiat blocks are computed using
the double failure correcting algorithm, which is apphed differently depending on
the positions of tlie disks in the row. Having stored two redmidant blocks ineach row,
j ^ the array can tolerate two disk failures, recovermg the lost data or redmidant blocks re-
•: gardless ofthe roles of the lost blocks in any particular stripe.
Alternatively, since the roles ofthe disks are deterministically defined, any al-
' • • gorithni that allows any two or more disks in the array to contain the redmidant inforination
can be employed. Using such an algorithm may require the recomputation of, '
parity in stripes where the parity blocks move, but it does preserve the advantage ofthe
invention that no data blocks are moved. SRD has the additional advantage that no par- •
ity blocks need be recomputed when parity block(s) are assigned to tlie newly added
disk(s).
The distribution teclmique described herein is particularly useful for systems
^ having fewer disks yet that want to utilize all read operations per second (ops) that are ;
available from tiiose disks. Performance of smaller arrays is bounded by the ops that ;
are achievable fi:om disks (disk-bound). Yet even in large arrays where disks get lar- ''
ger, because of reconsti-ucti'on times, the tendency is to reduce the number of disks pelgroup
132-136. This results in an increase in redundancy oveirhead (the percentage of
disks in a group devoted to redundancy increases). Therefore, it is desirable to take advantage
of the read ops available in those redundant disks. Another advantage of the
distribution teclmique is that reconstruction and/or recovery occurs "blindly" (i.e., ^
without knowing the roles ofthe disks).
Semi-static distribution may be adv'antageously used witli arrays having low
numbers of large disks, since the teclmique balances data across the array. Using larger
18
disks is required to get reasonable capacity, but that also means using smaller .groups to
limit reconstmction time. If a 14-disk configuration uses two gi'oups and one spare,
then over 20% of the disks are unavailable for use in storing or retrieving data. Configurations
with eight disks are even worse. . •
As noted, tlie semi-static distribution technique allows incremental addition of .
disks to a distributed parity implementation of a disk array. An advantage of the inven-,
tive distiibution technique over a RAID-5 level implementation is that it allows easy
expansion of the an-ay, avoiding the need to add an entire group to the array or to perfomi
an expensive RAID-5 reorganization. The semi-static distribution teclmique may
be used in connection with single/double failure error correction. In addition, the tech-
( w nique allows use of inultiple disk sizes m the same group 132-136.
(
While there has been shown and described illustrative embodiments of a semistatic
distribution technique that distributes parity across disks, it is to be miderstood
that various other adaptations and modifications may be made within the spirit and
scope of the uivention. For example, the distribution teclmique described herein may •'
apply to block-based RAID array's to, e.g., allow easy addition of disks to RAID
• • groups. Block-based RAID arrays generally are notaware-of which blocks they are
asked to store contain file system data. Instead, the arrays must assume that all blocks
not previously designated as parity blocks contain file system data. Therefore, they •?
usually pre-allocate which blocks will be used for parity. For a given array, these pre- i
^ ^ • allocated blocks remain fixed. Nomially this is done in some predetennmed algorithm U •
so that the system does not have to keep track of each parity block. •;
According to the invention, the RAID system may move the parity designation
of some of the blocks in the existing disks to the new disks usmg the seriii-static distribution
technique. The RAID system must also ensure tliat logical unit number (lun)
block offsets of non-parity blocks in the existing disks are not changed. The new space
will then be distributed among all the disks. Tliis non-linear mapping is usually not desirable
in block-based arrays, as file systems cannot compensate for it. However, this
effect can be mitigated if the parity blocks are allocated contiguously in large chunks
- (e.g. at least a ti-ack size).
n
It will be understood to those sldlled in the art tliat the inventive technique described
herein raay apply to any type of special-purpose (e.g., file ser storage appliance) or general-purpose computer, including a standalone com-
• puter or portion thereof, embodied as or including a storage system 100. An example
of a multi-protocol storage appliance that may be advantageously used, with the present • ',
invention is described in U.S. Patent Apphcation Serial No. 10/215,917 titled, Multiprotocol
Storage Appliaijce that provides Integi'ated Support for File and Block Access
Protocols, filed on August 8,2002. Moreover, the teachings of this invention can be
adapted to a variety of storage system arcliitectures including, but not limited to, a network-
attached storage environment, a storage area network and disk assembly directly-
^ • attached to a client or host computer. The tenn "storage system" should therefore be
taken broadly to include such arrangements in addition to any subsystems configured to
,perform a storage fonction and associated witli other equipment or systems.
The foregoing description has been directed to specific embodiments of tills in- ,:
vention. It will be apparent, however, that other variations and modifications may be
made to the described embodiments, with the attainment of some or all of their advan-
.tages. For instance, the semi-static distribution technique can be generalized to otlrer
applications involving the distribution of data structures among persistent storage, e.g.,
disks, or non-persistent storage, e.g., memory, of a system. Broadly, the teclinique may
apply to the redistribution of any commodity over any set of containers as more containers
are added to the system. As an example, the semi-static technique may apply to
( w a sj^stem having units and containers, wherein the units are distibuted imiformly over
the containers and wherein it is desirable to maintain a balanced rate of assignment of '
units to containers along some numbered dimension. When a new container is added to
the system, the technique may be employed to transfer some of the existing units to flie
new container in such a way that overall and localized balance is maintained.
More specifically, the semi-static teclinique can be applied to distribution of
data structures, such as inode file blocks, among persistent storage devices, ;such as
disks, of an array coupled to a plurality of storage entities, such as storage "lieads".
Note that a "head" is defined as all parts of a storage system, excluding the disks. An
example of such an application involves distributing existing inode file blocks over the
plurality of (N) storage heads, which includes one or more newly added storage heads.
Here, the inventive semi-static distribution technique may be used to move only 1/N of
any existiag inode file blocks to the newly added storage head.
It is expressly contemplated that the teachings of this invention can be implemented
as softvvare, including ..a computer-readable mediurh havuig pro grain instruclions
executing- on a computer, hardware, firmware, or a combination thereof Accordingly
tliis description is to be taken only by way of example .and not to otherwise limit •
the scope of the invention. Therefore, it is the object of the appended claims to cover
• all such variations and modifications as come within the true spirit and scope of the in-
•^ vention.







lAVE claim:
1. A method for distributing parity across a storage array, the method comprising the steps of:
adding a new storage device to a number of pre-existing storage devices of the array;
dividing each storage device into blocks, the blocks being organized into stripes such that each
stripe contains one block from each storage device; and
distributing, by a processor (122) of a storage system (200) parity among blocks of the new and
pre-existing storage devices without recalculation or moving of any blocks containing data,
w the method characterized in that
the distribution of parity is by reassigning every Nth parity block (P) from each of the preexisting
storage devices (202-206) to the new storage device (208) to arrange each storage device
of the array with approximately 1/N of the parity blocks (P), where N is equal to the number of
the pre-existing storage devices (202-206) plus the new storage device (208).
2. The method as claimed in claim 1, wherein the step of distributing comprises the step of
distributing parity among blocks of the new and pre-existing storage devices in a manner that
maintains a fixed pattern of parity blocks among stripes of the storage devices.
3. The method as claimed in claim 1, wherein the step of distributing comprises the step of
changing an assignment for one or more blocks containing parity of each pre-existing storage
device to the newly added storage device.
4. The method as claimed in claim 2 wherein the step of adding comprises the step of initializing
the added storage device so as to not affect parity of the stripes.
22
5. The method as claimed in claim 4 wherein the step of initializing comprises the step of
reassigning blocks containing parity in certain stripes to the new storage device without
calculation or writing of parity.
6. The method as claimed in claim 5 wherein the certain stripes comprise 1/N of the stripes,
where N is equal to the number of storage devices in the array.
^ 7. The method as claimed in claim 5 wherein the step of reassigning comprises the step of
changing a block containing parity (parity block) to a block containing data (data block) and not
changing a data block to a parity block.
8. The method as claimed in claim 1 wherein the step of distributing comprises the step of

reassigning one of N blocks containing parity (parity blocks) from each pre-existing storage
device to the added storage device, wherein N is equal to the number of storage devices in the
array.
9. The method as claimed in claim 8 wherein the step of reassigning comprises the step of
reassigning one of N parity blocks to the new storage device, with each pre-existing storage
W device continuing to hold 1/N of the parity blocks in the array.
10. A system adapted to distribute parity across storage devices of a storage system, the system
comprising:
a storage array comprising pre-existing storage devices and at least one new storage device; and
a storage module configured to compute parity in blocks of stripes across the storage devices and
reconstruct blocks of storage devices lost as a result of failure, the storage module further
configured to assign the parity among the blocks of the new and pre-existing storage devices
without recalculation or moving of any data blocks,
23
characterized in that
the distribution of parity is by reassigning every Nth parity block (P) from each of the preexisting
storage devices (202-206) to the new storage device (208) to arrange each storage device
of the array with approximately 1/N of the parity blocks (P), where N is equal to the number of
pre-existing devices (202/206) plus the new storage device (208).
11. The system of claim 10 further comprising a table configured to store parity assignments
calculated for one of a known group size of the storage array and a maximum group size of the
w array, the stored parity assignments defining a repeat interval of a parity distribution pattern used
to determine locations of parity storage on any storage device in the array.
12. The system of claim 10 wherein the storage module is embodied as a RAID system of the
storage system.
13. The system of claim 10 wherein the storage module is embodied as an internal storage array
controller of the storage system.

14. The system of claim 10 wherein the storage module is embodied as a storage array control
system externally coupled to the storage system.
15. The system of claim 10 wherein the disk array is a block-based RAID array.
16. A method for distributing commodities over containers of a system, the method comprising
the steps of:
24
adding a new container to pre-existing containers of the system to thereby provide N containers;
and
moving only 1/N of the commodities to the new container,
the method characterized in that:
the distribution of commodities is by reassigning every Nth commodity from each of the preexisting
containers to the new container to arrange each container with approximately 1/N of the
commodities.

17. The method of claim 16 wherein the system is a storage system, the commodities are data
structures adapted for storage on storage devices of an array, and the containers are storage
entities coupled to the array.
18. The method of claim 17 wherein the storage entities are storage heads.
19. The method of claim 17 wherein the data structures are inode file blocks.
^ Dated: 09/05/2006 O y^^^Vt
[R. MAHESH]
OF REMFRY & SAGAR
ATTORNEY FOR THE APPLICANT[s]
25

Documents:

2591-delnp-2006-Abstract-(02-08-2013).pdf

2591-delnp-2006-abstract.pdf

2591-delnp-2006-assignment.pdf

2591-delnp-2006-Claims-(02-08-2013).pdf

2591-delnp-2006-claims.pdf

2591-delnp-2006-Correspondence-Others-(02-08-2013).pdf

2591-delnp-2006-correspondence-others-1.pdf

2591-delnp-2006-correspondence-others.pdf

2591-delnp-2006-Description (Complete)-(02-08-2013).pdf

2591-delnp-2006-description (complete).pdf

2591-delnp-2006-Drawings-(02-08-2013).pdf

2591-delnp-2006-drawings.pdf

2591-delnp-2006-form-1.pdf

2591-delnp-2006-form-18.pdf

2591-delnp-2006-Form-2-(02-08-2013).pdf

2591-delnp-2006-form-2.pdf

2591-delnp-2006-Form-3-(02-08-2013).pdf

2591-delnp-2006-form-3.pdf

2591-delnp-2006-form-5.pdf

2591-delnp-2006-GPA-(02-08-2013).pdf

2591-delnp-2006-gpa.pdf

2591-delnp-2006-pct-101.pdf

2591-delnp-2006-pct-202.pdf

2591-delnp-2006-pct-210.pdf

2591-delnp-2006-pct-220.pdf

2591-delnp-2006-pct-237.pdf

2591-delnp-2006-pct-301.pdf

2591-delnp-2006-pct-304.pdf

2591-delnp-2006-pct-373.pdf


Patent Number 259793
Indian Patent Application Number 2591/DELNP/2006
PG Journal Number 13/2014
Publication Date 28-Mar-2014
Grant Date 27-Mar-2014
Date of Filing 09-May-2006
Name of Patentee NETWORK APPLIANCE INC.
Applicant Address 495 EAST JAVA DRIVE, SUNNYVALE, CA 94089, UNITED STATES FO AMERICA,
Inventors:
# Inventor's Name Inventor's Address
1 PETER F. CORBETT 33 SUMM ER STREET, LEXINGTON, MA 02420, UNITED STATES OF AMERICA
2 ROBERT M. ENGLISH, 4 EAST CREEK PLACE, MENLO PARK, CA 94025 UNITED STATES OF AMERICA.
3 STEVEN, R.KLEIMAN, 157 EL MONTE COURT, LOS ALTOS, CA 94022 UNITED STATES OF AMERICA
PCT International Classification Number H02M 1/084
PCT International Application Number PCT/US2004/039618
PCT International Filing date 2004-11-24
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 10/720,364 2003-11-24 U.S.A.