Title of Invention

PROCESSOR AND METHOD OF USING TWO SETS OF REGISTERS FOR MATRIX PROCESSING BY A PROCESSOR

Abstract A processor comprising: a first set of registers, the first set storing a matrix of data; a second set of registers coupled to the first set, the second set storing a transposed copy of the matrix of data; and logic that automatically modifies the second set of registers, so as to maintain the transposition relationship between the matrix of data stored in the first set of registers and the transpose of the matrix of data stored in the second set of registers, in response to a modification of the first set of registers.
Full Text FORM 2
THE PATENTS ACT 1970 [39 OF 1970]
COMPLETE SPECIFICATION [See Section 10]
"PEOCESSOR AND METHOD OF TWO SETS OF REGISTRES
FOR MATRIX PROCESSING BY A PROCESSOR"
INTEL CORPORATION, of 2200 Mission College Boulevard, Santa Clara, California 95052, United States of America,
The following specification particularly describes and ascertains the nature of the invention and the manner in which it is to be performed:-

REGISTERS FOR 2-D MATRIX PROCESSING
BACKGROUND
1. FIELD
The present invention relates generally to computer systems and, more specifically, to processor architectures.
2. DESCRIPTION
Some processors are designed to provide' extensions to their
instruction set architecture (ISA) for multimedia operations. For example,
the MMX™ instructions supported by Pentium® II, Pentium® III, and
Celeron™ processors, commercially available from Intel Corporation of Santa
Clara, California, implement various functions useful for multimedia
applications, such as digital signal processing, and audio and video
processing. These instructions support "single instruction multiple data"
(SIMD) operations on multimedia and communications data types. Although
the use of these instructions provide an improvement over combinations of
pre-existing instructions to perform a given function, and individual MMX™
instructions are efficient for some types of processing, various impediments
to faster multimedia processing still remain. For example, many
implementations of block-based image and video processing algorithms
(such as joint photographic experts group (JPEG) and moving picture expert
group (MPEG) schemes) result in the data, stored in a set of registers
accessible as operands for the MMX™ instructions, being transposed during
matrix mathematical operations. The transposition of data among registers
incurs significant overhead, however, thereby slowing overall processor
throughput for multimedia processing. Therefore, any techniques for
avoiding or minimizing these delays would be a valuable advance in the
processor art. I

SUMMARY
An embodiment of the present invention is a processor having a first set of registers, the first set storing a matrix of data, and a second set of registers coupled to the first set, the second set storing a transposed copy of the matrix of data.
Another embodiment of the present invention is a method of using two sets of registers for matrix processing by a processor. The method includes storing a matrix of data into a first set of registers, the first set of registers having a first number of registers, each register comprising a first number of storage units, each storage unit storing an element of the matrix, and transposing the matrix of data into a second set of registers, the second set of registers having a second number of registers, each register comprising a second number of storage units. The method also includes referencing one of the first set of registers to operate on a row of the matrix of data, and referencing one of the second set of registers to operate on a column of the matrix of data.
BRIEF DESCRIPTION OF THE DRAWINGS
The features and advantages of the present invention will become apparent from the following detailed description of the present invention in which:
Figure 1 is a diagram of a set of MMX™ registers according to the
prior art;
Figure 2 is a diagram of the set of MMX™ registers storing an 8 pixel
by 8 pixel block of image data;


Figure 3 is a diagram of the set of MMX™ registers storing the transposed 8 pixel by 8 pixel block of image data;
Figure 4 is a diagram of a virtual MMX™ register set linked to th MMX™ register set according to an embodiment of the present invention;
Figure 5 is a diagram of the set of virtual MMX™ registers storing the transposed 8 pixel by 8 pixel block of image data; and
Figure 6 is a diagram of a system having a processor with a MMX™ register set and a virtual MMX™ register set according to an embodiment of the present invention.
DETAILED DESCRIPTION
An embodiment of the present invention comprises a method anoV apparatus for extending MMX™ registers to be more efficient for two dimensional (2-D) matrix operations.
Reference in the specification to "one embodiment" or "an embodiment" of the present invention means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase "in one embodiment" appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
When executing an instruction, a processor typically references one or more register operands. For MMX™ instructions, the register operands may be one or more of a special set of registers called MMX™ registers. Figure 1 is a diagram of a set of MMX™ registers according to the prior art. In the register set 10 shown in Figure 1, there are eight MMX™ registers labeled mmO 12 through mm7 14. In other embodiments, the number of registers may be higher or lower than eight. Each register includes a

plurality of units of data, ordered from a low unit 1 6 to a high unit 1 8 as shown. In one embodiment, a unit comprises a byte. In other embodiments, a unit may comprise a word, a double word, or other storage unit. In at least one known system, the number of units (e.g., bytes) per MMX™ register is eight, although in other systems other numbers of units may be employed. To efficiently implement SIMD multimedia processing using the MMX™ registers, data to be processed should be aligned in such a way that multiple related data items are arranged in a single MMX™ register. For example, suppose an 8 pixel by 8 pixel block of image data is laid out in the MMX™ registers as shown in Figure 2, with each pixel value P(i, j) being represented in one unit and the overall set of registers representing a matrix. The 8 pixel by 8 pixel block may be a portion of a larger image. In this example, the first row of image data of the block is stored in the first MMX™ register, mmO 12, with the first column of the first row being stored in the low unit of mmO and the last column of the first row being stored in the high unit of mmO, the second row of image data is stored in the second MMX™ register, mm1 20, with the first column of the second row being stored in the low unit of mml and the last column of the second row being stored in the high unit of mm1, and so on.
Once the data is stored in the MMX™ registers as shown, the processor may execute instructions to efficiently operate on the 8x8 matrix one row at a time. This type of processing is commonly used in block-based imaging applications, for example, as well as other applications. For example, all of the data of row 0 may be added to the data of row 3 using a single MMX™ instruction as shown below,
PADDB MMO, MM3 ; add row 0 to row 3 and store results in
row 0.



However, to operate on the 8x8 matrix one column at a time presents problems because each column's data is spread out among the eight MMX™ registers. For example, the first column's data is distributed among the low units of mmO 12 through mm7 14, and the last column's data is distributed among the high units of mmO through mm7, respectively. To continue to gain the benefits of using MMX™ SIMD processing, it becomes necessary to transpose the 8x8 matrix as shown in Figure 3. Matrix transposition is well known in mathematics. After transposition, the first MMX™ register mmO 12 stores the original first column of data, with the first row of the original first column being stored in the low unit 16 and the last row of the original first column being stored in the high unit 18 as shown. Similarly, the other MMX™ registers store columns of the 8x8 matrix as shown.
Although an 8x8 matrix transposition may be performed using known pack and unpack instructions, this processing is very inefficient and incurs* significant processing overhead. Generally, transposition processing .for an 8x8 matrix may be implemented by the performance of four 4x4 matrix transpositions, with each transposition requiring at least 12 processing cycles on a Pentium®lll processor. Thus, at least 64 cycles are used merely to reposition the data so that an MMX™ instruction may be used to operate on the eight elements of a given column.
Embodiments of the present invention overcome the need to transpose the data in the MMX™ registers through a series of pack and unpack instructions by providing an additional set of registers in the processor to assist in multimedia processing. In embodiments of the present invention, an equivalent "mirror" set of the MMX™ registers is designed into the processor architecture. This register set may be called the virtual MMX register set, or VMX set. Figure 4 is a diagram of the virtual MMX register set linked to the MMX™ register set according to an embodiment of the present invention. In one embodiment, the number of MMX™ registers in MMX™ register set 10 is the same as the number of


VMX registers provided in VMX register set 22. In addition, in one embodiment, the number of VMX registers is greater than or equal to the number of units in each MMX™ register (with a unit being a byte, word, double word, or other storage unit, depending on the implementation). VMX register set 22 stores the transposed matrix data from the MMX™ register set and may be automatically updated by register update logic 23 whenever any unit of any register in the MMX™ set is modified. Hence, a load of one row of MMX™ register set 10 automatically results in a load of one column of VMX register set 22. For example, the data from the first MMX™ register mmO 1 2 may be automatically stored in the low units of the VMX registers VMO through VM7, the data from the second MMX™ register mm1 20 may be automatically stored in the second lowest units of the VMX registers, and so on.
Referring back to Figure 2, if the MMX™ registers are loaded with an 8x8 matrix designated P0,0 to P7,7 as shown, then the VMX registers may be automatically loaded by register update logic with the transposed matrix as shown in Figure 5. To operate on row elements of the matrix, a program may simply reference one or more of MMX™ registers mmO 12 through mm7 14. However, to operate on column elements of the matrix, a program references the VMX registers vmO 24 through vm7 26, instead. Since the MMX™ registers are mirrored in the processor hardware with the VMX registers, there are no coherency issues. All MMX™ instructions may be operated with either of the register sets by using references to the appropriate registers as needed. No changes to the instruction set of the processor are needed, only operand references may be changed in a program.
One advantage of embodiments of the present invention over existing processor architectures is that the invention provides greater parallelism for matrix operations. This is achieved by avoiding costly transposes for


column operations through the execution of multiple pack and unpack instructions.
One example of how the present invention may be employed is in the Discrete Cosine Transform (DCT) algorithm used for processing 8 pixel by 8 pixel blocks in many video processing schemes. Currently, to perform an 8x8 DCT, processing includes performing a 1x8 column transform first, transposing the 8x8 matrix, performing another 1x8 column transform, and then transposing the results again to get the DCT coefficients. The most optimized DCT currently running on Pentium® class processors takes about 300 cycles of processing. Of this amount, about 100 cycles are used to perform the transpose operations for an 8x8 matrix. Thus, implementation of the present invention yields an approximately 30% improvement for the DCT processing. Similar performance gains may be achieved for inverse DCTs. Although the example of a DCT is discussed herein, embodiments of the present invention may be useful for any matrix operations, including those used in various image and video compressions algorithms.
Although the present invention has been discussed above in the context of a two dimensional (2-D) matrix, the concept may be adapted to three or more dimensions. For example, a third register set may be included in the design of the processor to store other transpositions of the matrix data.
Figure 6 is a diagram of a system having a processor with a MMX™ register set and a virtual MMX™ register set according to an embodiment of the present invention. System 50 may include processor 52 coupled to memory 54. Processor 52 comprises various components well-known in the art, many of which have been omitted from Figure 6 for clarity. Instruction memory 56 stores instructions that may reference one or more of MMX™ registers 10 or one or more of VMX registers 22. Register update fogic 23 coordinates the automatic updating of the VMX registers 22 when the MMX™ registers change. Multiplexor (MUX) 58 selects data from


either MMX™ registers or VMX registers for input to arithmetic logic unit (ALU) 60. The ALU produces data for data memory 52.
The present invention allows MMX™ registers to be manipulated in a more intuitive manner. With the addition of the mirror register set, implementation of any block-based algorithms may be simplified and their performance increased. Some examples of applications which may benefit from the present invention include discrete cosine transforms (DCTs) used in video compression algorithms, matrix transforms in three dimensional {3-D) graphics algorithms, and others.
While this invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications of the illustrative embodiments, as well as other embodiments of the invention, which are apparent to persons skilled in the art to which the inventions pertains are deemed to lie within the spirit* and scope of the invention.


WE CLAIM :
1. A processor comprising:
a first set of registers, the first set storing a matrix of data;
a second set of registers coupled to the first set, the second set storing a transposed copy of the matrix of data; and
logic that automatically modifies the second set of registers, so as to maintain the transposition relationship between the matrix of data stored in the first set of registers and the transpose of the matrix of data stored in the second set of registers, in response to a modification of the first set of registers.
2. The processor as claimed in claim 1, wherein a modification of a row of the matrix in the first set of registers automatically results in a corresponding modification of a column of the transposed copy of the matrix in the second set of registers.
3. The processor as claimed in claim 1, wherein the first set of registers comprises a first number of registers, each register comprising a first number of storage units, the second set of registers comprises a second number of registers, each register comprising a second number of storage units, and the second number of storage units is greater than or equal to the first number of registers.
4. The processor as claimed in claim 3, wherein the first number of registers is equal to the second number of registers.
5. The processor as claimed in claim 4 wherein the first set of registers comprise multi-media extensions registers, the first number of registers is eight, and the matrix of data comprises image data.
6. The processor as claimed in claim 1, wherein the processor executes an instruction referencing one of the first set of registers to operate on a row of the matrix of data and executes an instruction referencing one of the second set of registers to operate on a column of the matrix of data.

7. A method of using two sets of registers for matrix processing by a processor as defined in claim 1 comprising:
storing a matrix of data into a first set of registers, the first set of registers comprising a first number of registers, each register comprising a first number of storage units, each storage unit storing an element of the matrix;
transposing the matrix of data into a second set of registers, the second set of registers comprising a second number of registers, each register comprising a second number of storage units;
automatically modifying the second set of registers, so as to maintain the transposition relationship between the matrix of data stored in the first set of registers and the transpose of the matrix of data stored in the second set of registers, in response to a modification of the first set of registers;
referencing one of the first set of registers to operate on a row of the matrix of data; and
referencing one of the second set of registers to operate on a column of the matrix of data.
8.
The method as claimed in claim 7, wherein modifying a row of the matrix in the first set of registers and automatically modifying a corresponding column of the transposed matrix in the second set of registers.
9. The method as claimed in claim 7, wherein performing a transform operation on column data stored in one of the second set of registers.
10. The method as claimed in claim 9, wherein performing the transform operation comprises performing a discrete cosine transform operation on column data stored in one of the second set of registers.
11. The method as claimed in claim 7, wherein the second number of storage units is greater than or equal to the first number of registers.
12. The method as claimed in claim 7, wherein the first number of registers is equal to the second number of registers.

13. The method as claimed in claim 7, wherein the first set of registers comprise multi-media extensions registers, the first number of registers is eight, and the matrix of data comprises image data.
14. A system comprising: a memory;
a processor as defined in claim 1 coupled to the memory, the processor
comprising:
a first set of registers, the first set storing a matrix of data; and
a second set of registers coupled to the first set, the second set storing
a transposed copy of the matrix of data; and
logic that automatically modifies the second set of registers so as to
maintain the transposition relationship between the matrix of data stored in
the first set of registers and the transpose of the matrix of data stored in the
second set of registers, in response to a modification of the first set of
registers.
15. The system as claimed in claim 14, wherein a modification of a row of the matrix in the first set of registers automatically results in a corresponding modification of a column of the transposed copy of the matrix in the second set of registers.
16. The system as claimed in claim 14, wherein the first set of registers comprises a first number of registers, each register comprising a first number of storage units, the second set of registers comprises a second number of registers, each register comprising a second number of storage units, and the second number of storage units is greater than or equal to the first number of registers.
17. The system as claimed in claim 16, wherein the first number of registers is equal to the second number of registers.
18. The system as claimed in claim 17, wherein the first set of registers comprise multi-media extensions registers, the first number of registers is eight, and the matrix of data comprises image data.


19. The system as claimed in claim 14, wherein the processor executes an instruction referencing one of the first set of registers to operate on a row of the matrix of data and executes an instruction referencing one of the second set of registers to operate on a column of the matrix of data.
20. A method of using two sets of registers for discrete cosine transform (DCT) processing of a matrix of image data by a processor as defined in claim 1 comprising:
storing the matrix into a first set of registers, the first set of registers comprising a first number of registers, each register comprising a first number of storage units, each storage unit storing an element of the matrix;
transposing the matrix into a second set of registers, the second set of registers comprising a second number of registers, each register comprising a second number of storage units;
automatically modifying the second set of registers, so as to maintain the transposition relationship between the matrix of data stored in the first set of registers and the transpose of the matrix of data stored in the second set of registers, in response to a modification of the first set of registers; and
performing DCT processing, at least in part, by referencing one of the second set of registers to operate on a column of the matrix.
21. The method as claimed in claim 20, wherein modifying a row of the matrix in
the first set of registers and automatically modifying a corresponding column
of the transposed matrix in the second set of registers.

Dated this 22nd day 0f January, 2002.



(JAYANTA PAL) OF REMFRY & SAGAR ATTORNEY FOR THE APPLICANTS

Documents:

abstract1.jpg

in-pct-2002-00083-mum-cancelled pages(22-01-2002).pdf

in-pct-2002-00083-mum-claims(granted)-(22-01-2002).doc

in-pct-2002-00083-mum-claims(granted)-(22-01-2002).pdf

in-pct-2002-00083-mum-correspondence(13-01-2005).pdf

IN-PCT-2002-00083-MUM-CORRESPONDENCE(15-9-2009).pdf

IN-PCT-2002-00083-MUM-CORRESPONDENCE(9-6-2009).pdf

in-pct-2002-00083-mum-correspondence(ipo)-(20-01-2004).pdf

in-pct-2002-00083-mum-drawing(22-01-2002).pdf

in-pct-2002-00083-mum-form 1(22-01-2002).pdf

in-pct-2002-00083-mum-form 13(15-9-2009).pdf

in-pct-2002-00083-mum-form 19(23-01-2003).pdf

in-pct-2002-00083-mum-form 2(granted)-(22-01-2002).doc

in-pct-2002-00083-mum-form 2(granted)-(22-01-2002).pdf

IN-PCT-2002-00083-MUM-FORM 26(15-9-2009).pdf

in-pct-2002-00083-mum-form 3(11-01-2005).pdf

in-pct-2002-00083-mum-form 3(22-01-2002).pdf

in-pct-2002-00083-mum-form 5(22-01-2002).pdf

in-pct-2002-00083-mum-form-pct-ipea-409(22-01-2002).pdf

in-pct-2002-00083-mum-form-pct-isa-210(22-01-2002).pdf

in-pct-2002-00083-mum-petition under rule 137(13-01-2005).pdf

in-pct-2002-00083-mum-petition under rule 138(13-01-2005).pdf

in-pct-2002-00083-mum-power of authority(03-08-2001).pdf

in-pct-2002-00083-mum-power of authority(11-01-2005).pdf


Patent Number 205594
Indian Patent Application Number IN/PCT/2002/00083/MUM
PG Journal Number 26/2007
Publication Date 29-Jun-2007
Grant Date 05-Apr-2007
Date of Filing 22-Jan-2002
Name of Patentee INTEL CORPORATION
Applicant Address 2200 MISSION COLLEGE BOULEVARD, SANTA CLARA, CALIFORNIA 95052, UNITED STATES OF AMERICA.
Inventors:
# Inventor's Name Inventor's Address
1 GEORGE K. CHEN 15115 NW FRANCESCA DRIVE, PORTLAND, OR 97229, UNITED STATES OF AMERICA.
PCT International Classification Number G06F 9/30
PCT International Application Number PCT/US00/17630
PCT International Filing date 2000-06-26
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 09/360,612 1999-07-26 U.S.A.