Title of Invention	A PROCESS OF CACHE MANAGEMENT
Abstract	A system and method for flushing a cache line associated with a linear memory address from all caches in the coherency domain. A cache controller receives a memory address, and determines whether the memory address is stored within the closest cache memory in the coherency domain. If a cache line stores the memory address, it is flushed from the cache. The flush instruction is allocated to a write-combining buffer within the cache controller. The write-combining buffer transmits the information to the bus controller. The bus controller locates instances of the memory address stored within external and intel cache memories within the coherency domain; these instances are flushed. The flush instruction can then be evicted from the write-combining buffer. Control bits may be used to indicate whether a write-combining buffer is allocated to the flush instruction, whether the memory address is stored within the closest cache memory, and whether the flush instruction should be evicted from the write-combining buffer.

Title of Invention

A PROCESS OF CACHE MANAGEMENT

Abstract

A system and method for flushing a cache line associated with a linear memory address from all caches in the coherency domain. A cache controller receives a memory address, and determines whether the memory address is stored within the closest cache memory in the coherency domain. If a cache line stores the memory address, it is flushed from the cache. The flush instruction is allocated to a write-combining buffer within the cache controller. The write-combining buffer transmits the information to the bus controller. The bus controller locates instances of the memory address stored within external and intel cache memories within the coherency domain; these instances are flushed. The flush instruction can then be evicted from the write-combining buffer. Control bits may be used to indicate whether a write-combining buffer is allocated to the flush instruction, whether the memory address is stored within the closest cache memory, and whether the flush instruction should be evicted from the write-combining buffer.

Full Text	FORM 2 THE PATENTS ACT 1970 [39 OF 1970] & THE PATENTS RULES, 2003 COMPLETE SPECIFICATION [See Section 10; rule 13] 'CACHE LINE FLUSH MICRO-ARCHITECTURAL IMPLEMENTATION 1 METHOD AND SYSTEM" We, INTEL CORPORATION, a Delaware Corporation, of 2200 Mission College Boulevard, Santa Clara, California 95054, United States of America, The following specification particularly describes the invention and the manner in which it is to be performed: CLFLUSH MICRO-ARCHITECTURAL IMPLEMENTATION METHOD AND SYSTEM BACKGROUND Field of the Invention The present invention relates in general to computer architecture, and in particular to a method and system that allow a processor to flush a cache line associated 5 with a linear memory address from all caches in the coherency domain. Description of the Related Art A cache memory device is a small, fast memory that u available to contain the most frequently accessed data (or '^words'") from a larger, slower memory. Dynamic random access memory (DRAM) provides large amounts of storage 10 capacity at a relatively low cost. Unfortunately, access to dynamic random access memory is slow relative to the processing speed of modem microprocessors. A cost-effective solution providing cache memory is to provide a static random access memory (SRAM) cache memory, or cache memory physically located on the processor. Even though the storage capacity of the cache memory may be relatively small, it provides 15 high-speed access1 to the data stored therein. The operating principle behind cache memory is as follows. The first time an instruction or data location is addressed* it must be accessed from the lower speed memory- The instruction or data is then stored in cache memory. Subsequent accesses to the same instruction or data are done via the faster cache memory, thereby 20 minimizing access ante and enhancing overall system performance. However, since the storage capacity of the cache is limited, and typically is much smaller than the storage capacity of system memory, the cache is often filled and some of its contents must be changed as new instructions or data are accessed. The cache is managed, in various ways, so max it stores the instruction or data 25 most likely to be needed ai a given rime. When the cache is accessed and contains the requested data, a cache "hit'1 occurs. Otherwise, if the cache does not contain the 1 requested data, a cache "miss" occurs. Thus, the cache contents are typically managed m an attempt to maximize the cache hit-to-miss ratio. With current systems, flushing a_specific.memory address, in a cache requires knowledge of the cache memory replacement algorithm. 5 A cache, in its entirety, may be flushed periodically, or when certain predefined conditions are met. Furthermore, individual cache lines may be flushed as part of a replacement algorithm. In systems that contain a cache, a cache line is the complete data portion that is exchanged between the cache and the main memory. In each case, dirty data is written to main memory. Dirty data is defined as data, not yet written to 10 main memory, in the cache to be flushed or in the cache line to be flushed. Dirty bits, which identify blocks of a cache line containing dirty data, are then cleared. The flushed cache or flushed cache lines can then store new blocks of data. If a cache flush is scheduled or if predetermined conditions for a cache flush are i met, the cache is flushed. That is, all dirty data in the cache is written to the main 15 memory. For the Intel family of P6 microprocessors (e.g., Pentium n, Celeron), for example, there exists a set of micro-operations used to flush cache lines at specified cache levels given \|a cache set and way; however, there is not such a micro-operauon to flush a cache line given its memory address. 20 Systems that require high data access continuously flush data as it becomes dirty. The situation is particularly acute in systems that require high data flow between the processor and system memory, such as the case in high-end graphics pixel manipulation for 3-D and video performances. The problems with current systems are mat high bandwidth between me cache and system memory is required to accommodate the 25 copies from write combining memory and write back memory. Thus, what is needed is a method and system that allow a processor to flush the cache line associated with a linear memory address from all caches in the coherency domain. 2 SUMMARY The cache line Hush (CLFUISH) micro-architectural implementation process and system allow a processor 10 flush a cache line associated with a linear memory address from all caches in the coherency domain. The processor receives a memory 5 address. Once the memory address is received, it is determined whether the memory address is stored within a cache memory. If the memory address is stored within the cache, the memory address is flushed from the cache. BRIEF DESCRIPTION OF THE DRAWINGS The inventions claimed herein will be described in detail with reference to the 10 drawings in which reference characters identify correspondingly throughout and wherein: FIG. 1 illustrates a microprocessor architecture; and FIG. 2 flowcharts an embodiment of the cache line flush process. DETAILED DESCRIPTION 15 By definition, a cache line is cither completely valid or completely invalid; a cache line may never been partially valid. For example, when the processor only wishes to read one byte, all the bytes of an applicable cache line must be stored in the cache; otherwise, a cache miss will occur. The cache line forms the actual cache memory. A cache directory is used only for cache management. Cache tines usually contain more 20 data than it is possible to transfer in a single bus cycle. For this reason, most cache controllers implement a bunt mode, in which pre-set address sequences enable data to be transferred more quickly through'a bus. This is used for cache line tills, or for writing back cache lines, because such these cache lines represent a continuous and aligned address area, 25 A technique to flush the cache line can be associated with a linear memory address. Upon execution, the technique Hushes the cache line associated with the operand from all caches in the coherency domain In a multi-processor environment, for example, the specified cache line is flushed from all cache hierarchy levels in all microprocessors in the system (ie. the coherency domain), depending on processor 30 state. The MESI (Modified, Exclusive, Shared, Invalid) protocol, a write-hrvalidate protocol, gives every; cache line one of tour states which are managed by two MESI-bits. The four states also identify the four possible states of a cache line. If the processor is found in "exclusive": or "shared'* states, the flushing equates _to the .cache line being invalidated. Another example is true when the processor is found in "modified" state. If 5 a cache controller implements a write-back strategy and, with a cache hit, only writes data from the processor to its cache, the cache line content must be transferred to the main memory, and the cache line is invalidated- When compared to other memory macroinstructions, the cache line Hush (CLFLUSH) method is not strongly ordered, regardless of the memory type associated 10 with the cache line flush macroinstrucdon. In contrast, the behavior in the memory sub-system of the processor is weakly ordered- Other macro-instructions, can be used to strongly order and guarantee memory access loads, stores, fences, and other serializing instructions, immediately prior to and right after CLFLUSH. A micro-operation, named "clflush-micro-op” is used to implement the 15 CLFLUSH macroinstruction. Moving to FIG. 1, an example microprocessor's memory and bus subsystems is shown with the flow of loads and stores. In FIG. 1, two cache levels are assumed in the microprocessor: an on-chip ("LI") cache being the cache level closest to the processor, and second level ("L2") cache being the cache level farthest from the processor. Aa 20 instruction fetch unit 102 fetches macroinsmictjons for an instructions decoder unit 104. The decoder unit 104 decodes the macrainstrucdons into a stream of microinstructions, which are forwarded to a reservation station 106, and a reorder buffer and register file 108. As an instruction enters the memory subsystem, it is allocated in the load 112 or store buffer 114, depending on whether it is a read or a write memory macroinstruction, 25 respectively. In the unit of the memory subsystem where such buffers reside, the instruction goes through memory ordering checks by the memory ordering unit 110. If no memory dependencies exist, the instruction is dispatched to the next unit in the memory subsystem after undergoing the physical address translation. At the Ll cache controller 120, it is determined whether there is an Ll cache hit or miss. In the case of a 30 miss, the instruction is allocated into a set of buffers, from where it is dispatched to the bus sub-system 140 of me microprocessor. In case of a cacheable load miss, the 4 instruction is sent to read buffers, 122t or in the case of a cacheable store miss, the instruction is sent to write buffers 130. The write buffers may be either weakly ordered write combining buffers 132 or non-write combining buffers 134. In the bus controller unit 1407 the read or write micro-operation is allocated into an out-of-order queue 144. If 5 the micro-operation is cacheable, the 1.2 cache 146 is checked for a hit/miss. If a miss, the instruction is sent through an in-order queue 142 to the frontside bus 150 to retrieve or update the desired data from main memory. The flow of the "clilush_rmcro_.op"niicro-operarion through the processor memory subsystem is also described in FIG. 2. Initially, the instruction fetch unit 102 10 retrieves a cache line flush instruction, block 202. hi block 2047 the cache line flush instruction is decoded into the "ciflu5h_micro_op" micro-operation by the instructions i ~ decoder unit 104. the micro-operation is then forwarded to a reservation station 106, and a recorder buffer and register file 108, block 206. The "clflush – micro - op" micro-operation is dispatched to the memory subsystem on a load port, block 208. It is 15 allocated an entry in the load buffer 112in.the memory ordering unit UQ. Forsplit accesses calculation m the memory ordering unit 110. the data size of the micro operation is masked to one byte in order to avoid cache line splits; however, upon execution, the whole cache line will be flushed. i . The behavior of thertcifiush_micro_orj" in the memory-ordering unit 110 is 20 speculative. Simply put, this means that the "clflush micro op" can execute out of order respect to other CLFLUSH macroinstructions, loads and stores. Unless memory access fencing (termed "MFENCE") instructions are used appropriately, {immediately before and after of the CLFLUSH macro-instruction), execution of the "clflush micro op” with respect other memory loads and stores is not guaranteed to be 25 in order, provided there are no address dependencies. The behavior of CLFLUSH through the memory subsystem is weakly ordered. The following tables list the ordering constraints on CLFLUSH- Table I lists the ordering constraint affects of later memory access commands compared to an earlier CLFLUSH. Table 2 lists the converse of table I, displaying the ordering constraint affects of earlier memory access commands 30 compared to a later CLFLUSH instruction. The memory access types listed are 5 uncacheable (UC} memory, write back {WB) memory, and uncacheable speculative write combining (USWC) memory accesses. Earlier access : Later access UC memory WB memory USWC memory CLFLUS MFENCE Load iSiore Load Store Load Store CLFLUS N ! N Y Y Y Y Y K Note:. N = Cannot pass, x - can pass. Table 1, Memory ordering of instructions with respect to an older CLFLUSH 1 Earlier access Later access CLFLUSH UC memory Load Y Store Y WB memory Load Y Store Y USWC memory Load Y Store Y ii CLFLUSH Y i MFENCE N 5 Table 2: Memory ordering of instructious with respect to a younger CLFLUSH From the Memory-ordering unit 110, die "clflush_micro_op" micro-operation is dispatched to the Ll cache controller unit 120, block 210. The "clflush_micro_op" micro-operaaon is dispatched on the load port; howevert it is allocated in a write combining buffer 132, as if it were a store, from the L1 cacue controller unit forward, 10 the "clflush_micro opM is switched from the load to the store pipe. Decision block 212 determines whether no write combining buffers 132 are available. If none are available, flow returns to block 210. Otherwise, How continues into block 214. Regardless of the memory type and whether it hits or misses the LI cache, a write combining buffer 132 is allocated to service an incoming i 15 “clflush_micro op," block 214. A control field is added to each write combining buffer — i 132 in the LI cache controller unit to determine which self-snoop attributes need to be sent to the bus controller 140. This control bit, named “clflush_miss," is set exclusively for a "clfiushjrwrojop” that misses the LI cache. Upon entering the memory sub-system of the microprocessor, several bits of the 20 address that enable cache line access of a "clflush_micro_op” are zeroed out, block 216. In the Pentium pro family of microprocessors, these would be the lower five bits of the address (address[4:0]). This is done in both the LI cache and L2 cache controller units 120, upon executing the flush command. The zeroing out helps to determine a cache line hit or miss. Since only tag match determines a hit or miss, no byte enable comparison is needed. Note that by definition, no partial hit is possible. A hit or miss is 5 always full line hit or miss. Zeroing out the address bits [4:0] also provides an altemative mechanism to the one used in the memory ordering unit II0 to mask line split accesses. In split accesses the data size of the transaction is masked one byte. i Another control bit added to each write combining bufTer 132 in the LI cache controller unit 120 is used to differentiate between a write combining buffer 132 10 allocated for a “clflush micro op” and another one allocated for a write combining i store, block 218. This control bit, named "clflush_op," is exclusively set for those write combining buflers allocated to service a ,,clftushjnicra_op'\ It is used to select the request type and flush attributes seat from the LI cache controller 120 to the bus controller 140. 15 In the case of an LI cache hit, as determined by decision block 222, both "flush LI" and-'flush L2" attributes are sent to the bus controller 140 upon dispatch from the LI cache controller unit 120, blocks 224 and 226. The bus controller 140 contains both the L2 cache 146 and external bus controller units. Alternatively, in the case of a LI cache miss, as determined by decision block 20 222,the "clflush_miss" control bit is set, and only the "flush L2" attribute is sent blocks 228 and 232. This helps improve performance by omitting the internal self-snoop to the LI cache. Upon its dispatch from the memory-ordering unit 110, the "clflush micro_op" micro-operation is blocked by the Ll cache controller unit 120 if there are no write 25 combining buffers 132 available, block 212. In such a case, it also evicts a write- combining buffer 132, as pointed by the writs-combining circular allocation pointer. This guarantees no deadlock conditions due to the lack of free write combining buffers 132. If blocked, the "clflush_micro_op" is redispatthed once the blocking condition is removed. An example that would cause the dispatching of the ~clflush_micro_op" 30 instruction is the completed eviction of a previously allocated write-combining buffer 132. 7 The "cIflush_micro_op" micro-operation is retired by the memory subsystem upon being allocated into a write-combining buffer 132 in the LI cache controller 120. This allows pipelining: subsequent instructions to proceed with execution prior to completion of the "clflush_micro_op” micro-operation. The pipelining improves the 5 overall performance of the system-There are two methods to evict a write-combining buffer servicing a "clflush_micro_op" micro-operanon. A write combining buffer 132 servicing a "clflush_micro_op" will be evicted by the same current eviction conditions that currently apply to write combining buffers 132 10 in the family of Intel P6 microprocessors. Moreover, fencing microinstructions also evict a write-combining buffer that services a "clflush_micro_op,* micro-operarion- Additionally, some embodiments evict a “clflush_micro_op” exclusively. This is done to avoid leaving stranded (pending) a write combining buffer servicing a "clflush_micra_op" for a long period of time, when the programmer does not want to 15 enforce ordering, and a fencing instruction is not used A control bit, named "clflush-evict", is associated with each write-combining buffer 132 servicing a "clflush-micro-oprt. This control bit is set when a write combining buffer 132 is allocated to a "clflush-micro_op." Once the "cl2ush_evict" bit is set, the corresponding write-combining-buffer- is marked for eviction and the control bit is reset, block 230. 20 This eviction condition applies exclusively to write combining buffers 132 servicing a “clflush _micro-op” micro-operation. It improves performance of programs using CLFLUSH by not allowing "clflush- micro- op" micro-operations to take up the write combining buffer 132 resources for extended periods of time, and consequently, freeing them up for other write combining operations. 8 Clflub_rai5£ " control bit HCmush_op" cootrol bit Request type uFlwbil" attribute "Flush U attribute New transaction '0 '0 Non-CLFUJSK - ■ -■ NO '0 l CLFLUSH 1 '1 YES 1 0 N/A N/A N/A Illegal combination •1 \ CXFLUSH . Q •I YES Table 3; Memory to Bus Transactions for CLFLUSH Note that if Clflush-miss" = "clflush_op" = ‘0’ the request type is any of the existing transactions in the P6 family of microprocessors (but not CLFLUSH), and the 5 flush attributes will be set/cleared accordingly. Table 4 below shows the condinons under which the three write combining buffer 132 control bits are set and reset. The "clfiush.evict" control bit can only be set after the "clfiush_micro_op” control bit The'-clflushjnicto_opn control bit will be set on speculative write combining buffer 132 allocations, while ',clflusb_evictt' will 10 exclusively be set on a real write combining buffer \ 32 allocation for a rtclflushjop". The "clflush miss" coraroi bit is also set on speculative write combining buffer 132 allocations, if me "clflush_micro_op'” misses the LI cache. Both, the "clflush_raiss" and "clflush_opw control bits arc cleared upon speculative allocation of a write-combiniag butter 132 to service any instruction other than a "clflush- micro- op." 15 Functionally, tins is suunartodearing such-«»ntrolbits.upottj^loc^onof a write-combining buffer servicing a ~clflushjnicto_op." la a processor implementation, the same write buffers 130 are shared for write combining and non-write combining micro operations. The “citlush_miss" and "clflushjnicrojap" bits are cleared upon speculative allocation of any write buffer 130, not just a write combining buffer 132. 20 This behavior ensures that the three control bits can never be set for a write buffer 130 not servicing a "clflush- micro-op." In a processor implementation, where all LI cache controller buffers are shared for both reads and writes, such as in the family of P6 microprocessors, the clflushjniss" and "clflus-micro-op” control bits only need to be cleared upon allocation of a buffer to service a store, block 234, Buffers allocated to 25 service loads ignore the value of these three new control bits. 9 Control bit Set Clear ' "Clflush_op" Upon allocation of a write combining buffer 10 service a *clfiusb_jnicro_op" Upon allocation of a write buffer for something ocher"[iiafl a "cIflush_nacro_op" "Clflush_cv;c tnunediatciy after allocation of a write combining buffer to service a -ciflusiijnicro^op' (LC, WC buffer allocated, "is use", and "clflushjop" control b« set) Upoa eviction of a write combining buffer (.i.c, "WC mode" control fart set) ~affusb_inis5 Upon allocation in a write combining buffer of a "clflusiwiucrojsp" that nusse* the LI cache "clflushjiuss" Upon allocation of a write buffer for aometbmg other than a "clflusb_jnicraj3p" Note that all three new WC buffer control bits are cleared upon a "'reset" sequence as well. Table 4: Conditions to set/clear the new control bits of a write-combining buffer 5 in the LI cache controller Embodiments may be implemented utilizing the bus controller 140. When a write-combining buffer 132 servicing a "clflush-micro-op” is marked for eviction, it is dispatched to the bus controller 140, block 236. The request sent is the same as if it was for a full line cacheable write combining transaction, except for the self-snoop attributes. 10 Snooping is used to verify the presence of 3 specific memory address is present in the applicable cache. For a "clflusbjnicro__op" ev^c^the^uTc^nttttller"l:40-self=snoops - -the U and L2 caches based oatue 'flush LI1* and "flush 1-2" request attributes, block 250. furthermore, the bus controller 140 issues a "bus read invalidate line" on the external bus, block 236. If the LI cache controller unit. 120 determines an LI cache 15 miss, for example, no "flush LI" message is sent. The "bus read invalidate line" transaction Hushes hits to the same line in any other caches in the coherency domain. On the external bus transaction, all byte enables are deasserted, masking the data phase from the core. Decision blocks 23 8 and 252 determine whether a hit for a modified cache line (HTTM) has occurred in another cache within the coherency domain (i.e., not 20 the LI or L2caches in the requesting microprocessor). If die HTTM occurs, the cache that is hit does a write back to main memory, and data is returned to the requesting microprocessor in blocks 244 and 254, The write combining buffer 132 in the LI cache controller unit 120 remains allocated until completion of the snoop phase and possible 10 transfer of data back from another cache in the coherency domain, for example, a HITM on an external bus.; Data coming back to the write-combming buffer 132 as a result of the snoop phase or inquiry cycle Is ignored, blocks 246 and 248. Ail flushes are then completed, and the write combining buffers 132 are 5 deallocated in block 260. Table 5 below shows how the external bus controller 140 treats all write-combining evictions. The request from the LI cache to the bus controller 140 for a "clflush_micro_op" eviction, such as the CLFLUSH macro-instruction, can be overloaded on the same request signals as that for a full line cacheable write combining 10 eviction; however, the self-snoop attributes differ. Table 5: External bus controller transactions for write combining evictions Request Extern tl Tranta Byte FlwnL1 Flush L2 New type buscranaacttoo ctioalength enable* Pans! Read 32 byre All byte NO NO NO cacheable Invalidate enables wroe 1 asserted combining Fullline Invalidate 32 byte AUbyta NO NO NO cscheable easbles wnie _ . dca»5cne4 __ „ combining Partial Memory uncscheab WIUC byte as sent from [emporsl Is write (wrne1 U cache swestfcat combining type) controller miss LI cache Full line Memory 32 byte Ail byte NO OslynoB- NO vmcachtah write enables letnpofai le write (writeback asserted stores that combining yp> nussLl cache CLFIUS Bus Read 32byte All bys QntyU YES YES H Invalidate enables deasscrted hta Note: USWC stores are not memory aliased in the P6 family of microprocessors, and therefore, they are not self-snooped. For testability arid debug purposes, a non-user visible mode bit can be added to enable/disable the CJLFLUSH macroinstruction. If disabled^ the LI cache controller mm 120 treats the incoming “cltlush_micro_op" micro-operation as a No-Operation-Opcode (“NOP"), and it never allocates a write-combining buffer 132. This NOP behavior can i 5 be implemented on uncacheable data prefetches The previous! description of the embodiments is provided to enable any person skilled in the art to make or use the system and method. It is well understood by those in the an, that the, preceding embodiments may be implemented using hardware, firmware, or instructions encoded on a computer-readable medium. The various 10 modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. 15 12 We Claim: 1. A process comprising: receiving a memory address by a cache controller; determining whether the memory address is stored within the closest cache memory in the coherency domain; flushing the cache line associated with the memory address stored within the closest cache memory in the coherency domain; allocating a flush instruction with the memory address to a write combining buffer; transmitting the flush instruction with the memory address to a bus controller through the write combining buffer; locating instances of the memory address stored within external cache memory within the coherency domain; and flushing instances of the memory address stored within the external cache memory within the coherency domain. 2. The process of claim 1, wherein a first control bit indicates to the bus controller whether a write combining buffer is allocated to the flush instruction with the memory address. 3. The process of claim 2 wherein a second control bit indicates to the bus controller whether the memory address is stored within the closest cache memory in the coherency domain.

Full Text

FORM 2
THE PATENTS ACT 1970
[39 OF 1970]
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
[See Section 10; rule 13]
'CACHE LINE FLUSH MICRO-ARCHITECTURAL IMPLEMENTATION
1 METHOD AND SYSTEM"
We, INTEL CORPORATION, a Delaware Corporation, of 2200 Mission College Boulevard, Santa Clara, California 95054, United States of America,
The following specification particularly describes the invention and the manner in which it is to be performed:

CLFLUSH MICRO-ARCHITECTURAL IMPLEMENTATION METHOD AND SYSTEM
BACKGROUND
Field of the Invention
The present invention relates in general to computer architecture, and in particular to a method and system that allow a processor to flush a cache line associated
5 with a linear memory address from all caches in the coherency domain.
Description of the Related Art
A cache memory device is a small, fast memory that u available to contain the most frequently accessed data (or '^words'") from a larger, slower memory.
Dynamic random access memory (DRAM) provides large amounts of storage
10 capacity at a relatively low cost. Unfortunately, access to dynamic random access memory is slow relative to the processing speed of modem microprocessors. A cost-effective solution providing cache memory is to provide a static random access memory (SRAM) cache memory, or cache memory physically located on the processor. Even though the storage capacity of the cache memory may be relatively small, it provides
15 high-speed access1 to the data stored therein.
The operating principle behind cache memory is as follows. The first time an instruction or data location is addressed* it must be accessed from the lower speed memory- The instruction or data is then stored in cache memory. Subsequent accesses to the same instruction or data are done via the faster cache memory, thereby
20 minimizing access ante and enhancing overall system performance. However, since the storage capacity of the cache is limited, and typically is much smaller than the storage capacity of system memory, the cache is often filled and some of its contents must be changed as new instructions or data are accessed.
The cache is managed, in various ways, so max it stores the instruction or data
25 most likely to be needed ai a given rime. When the cache is accessed and contains the requested data, a cache "hit'1 occurs. Otherwise, if the cache does not contain the
1

requested data, a cache "miss" occurs. Thus, the cache contents are typically managed
m an attempt to maximize the cache hit-to-miss ratio.
With current systems, flushing a_specific.memory address, in a cache requires
knowledge of the cache memory replacement algorithm.
5 A cache, in its entirety, may be flushed periodically, or when certain predefined
conditions are met. Furthermore, individual cache lines may be flushed as part of a
replacement algorithm. In systems that contain a cache, a cache line is the complete
data portion that is exchanged between the cache and the main memory. In each case,
dirty data is written to main memory. Dirty data is defined as data, not yet written to
10 main memory, in the cache to be flushed or in the cache line to be flushed. Dirty bits,
which identify blocks of a cache line containing dirty data, are then cleared. The flushed
cache or flushed cache lines can then store new blocks of data.
If a cache flush is scheduled or if predetermined conditions for a cache flush are i met, the cache is flushed. That is, all dirty data in the cache is written to the main
15 memory.
For the Intel family of P6 microprocessors (e.g., Pentium n, Celeron), for example, there exists a set of micro-operations used to flush cache lines at specified cache levels given |a cache set and way; however, there is not such a micro-operauon to flush a cache line given its memory address.
20 Systems that require high data access continuously flush data as it becomes dirty.
The situation is particularly acute in systems that require high data flow between the processor and system memory, such as the case in high-end graphics pixel manipulation for 3-D and video performances. The problems with current systems are mat high bandwidth between me cache and system memory is required to accommodate the
25 copies from write combining memory and write back memory.
Thus, what is needed is a method and system that allow a processor to flush the cache line associated with a linear memory address from all caches in the coherency domain.
2

SUMMARY
The cache line Hush (CLFUISH) micro-architectural implementation process and system allow a processor 10 flush a cache line associated with a linear memory address from all caches in the coherency domain. The processor receives a memory
5 address. Once the memory address is received, it is determined whether the memory address is stored within a cache memory. If the memory address is stored within the cache, the memory address is flushed from the cache.
BRIEF DESCRIPTION OF THE DRAWINGS
The inventions claimed herein will be described in detail with reference to the
10 drawings in which reference characters identify correspondingly throughout and wherein:
FIG. 1 illustrates a microprocessor architecture; and
FIG. 2 flowcharts an embodiment of the cache line flush process.
DETAILED DESCRIPTION
15 By definition, a cache line is cither completely valid or completely invalid; a
cache line may never been partially valid. For example, when the processor only wishes to read one byte, all the bytes of an applicable cache line must be stored in the cache; otherwise, a cache miss will occur. The cache line forms the actual cache memory. A cache directory is used only for cache management. Cache tines usually contain more
20 data than it is possible to transfer in a single bus cycle. For this reason, most cache controllers implement a bunt mode, in which pre-set address sequences enable data to be transferred more quickly through'a bus. This is used for cache line tills, or for writing back cache lines, because such these cache lines represent a continuous and aligned address area,
25 A technique to flush the cache line can be associated with a linear memory
address. Upon execution, the technique Hushes the cache line associated with the operand from all caches in the coherency domain In a multi-processor environment, for example, the specified cache line is flushed from all cache hierarchy levels in all microprocessors in the system (ie. the coherency domain), depending on processor
30 state. The MESI (Modified, Exclusive, Shared, Invalid) protocol, a write-hrvalidate

protocol, gives every; cache line one of tour states which are managed by two MESI-bits. The four states also identify the four possible states of a cache line. If the processor is found in "exclusive": or "shared'* states, the flushing equates _to the .cache line being invalidated. Another example is true when the processor is found in "modified" state. If
5 a cache controller implements a write-back strategy and, with a cache hit, only writes data from the processor to its cache, the cache line content must be transferred to the main memory, and the cache line is invalidated-
When compared to other memory macroinstructions, the cache line Hush (CLFLUSH) method is not strongly ordered, regardless of the memory type associated
10 with the cache line flush macroinstrucdon. In contrast, the behavior in the memory
sub-system of the processor is weakly ordered- Other macro-instructions, can be used to strongly order and guarantee memory access loads, stores, fences, and other serializing instructions, immediately prior to and right after CLFLUSH.
A micro-operation, named "clflush-micro-op” is used to implement the
15 CLFLUSH macroinstruction.
Moving to FIG. 1, an example microprocessor's memory and bus subsystems is shown with the flow of loads and stores. In FIG. 1, two cache levels are assumed in the microprocessor: an on-chip ("LI") cache being the cache level closest to the processor, and second level ("L2") cache being the cache level farthest from the processor. Aa
20 instruction fetch unit 102 fetches macroinsmictjons for an instructions decoder unit 104. The decoder unit 104 decodes the macrainstrucdons into a stream of microinstructions, which are forwarded to a reservation station 106, and a reorder buffer and register file 108. As an instruction enters the memory subsystem, it is allocated in the load 112 or store buffer 114, depending on whether it is a read or a write memory macroinstruction,
25 respectively. In the unit of the memory subsystem where such buffers reside, the
instruction goes through memory ordering checks by the memory ordering unit 110. If no memory dependencies exist, the instruction is dispatched to the next unit in the memory subsystem after undergoing the physical address translation. At the Ll cache controller 120, it is determined whether there is an Ll cache hit or miss. In the case of a
30 miss, the instruction is allocated into a set of buffers, from where it is dispatched to the bus sub-system 140 of me microprocessor. In case of a cacheable load miss, the
4

instruction is sent to read buffers, 122t or in the case of a cacheable store miss, the instruction is sent to write buffers 130. The write buffers may be either weakly ordered write combining buffers 132 or non-write combining buffers 134. In the bus controller unit 1407 the read or write micro-operation is allocated into an out-of-order queue 144. If
5 the micro-operation is cacheable, the 1.2 cache 146 is checked for a hit/miss. If a miss, the instruction is sent through an in-order queue 142 to the frontside bus 150 to retrieve or update the desired data from main memory.
The flow of the "clilush_rmcro_.op"niicro-operarion through the processor memory subsystem is also described in FIG. 2. Initially, the instruction fetch unit 102
10 retrieves a cache line flush instruction, block 202. hi block 2047 the cache line flush
instruction is decoded into the "ciflu5h_micro_op" micro-operation by the instructions
i ~
decoder unit 104. the micro-operation is then forwarded to a reservation station 106,
and a recorder buffer and register file 108, block 206. The "clflush – micro - op"
micro-operation is dispatched to the memory subsystem on a load port, block 208. It is
15 allocated an entry in the load buffer 112in.the memory ordering unit UQ. Forsplit
accesses calculation m the memory ordering unit 110. the data size of the micro
operation is masked to one byte in order to avoid cache line splits; however, upon
execution, the whole cache line will be flushed. i . The behavior of thertcifiush_micro_orj" in the memory-ordering unit 110 is
20 speculative. Simply put, this means that the "clflush micro op" can execute out of order respect to other CLFLUSH macroinstructions, loads and stores. Unless memory access fencing (termed "MFENCE") instructions are used appropriately, {immediately before and after of the CLFLUSH macro-instruction), execution of the "clflush micro op” with respect other memory loads and stores is not guaranteed to be
25 in order, provided there are no address dependencies. The behavior of CLFLUSH
through the memory subsystem is weakly ordered. The following tables list the ordering constraints on CLFLUSH- Table I lists the ordering constraint affects of later memory access commands compared to an earlier CLFLUSH. Table 2 lists the converse of table I, displaying the ordering constraint affects of earlier memory access commands
30 compared to a later CLFLUSH instruction. The memory access types listed are
5

uncacheable (UC} memory, write back {WB) memory, and uncacheable speculative write combining (USWC) memory accesses.

Earlier access : Later access
UC memory WB memory USWC memory CLFLUS MFENCE
Load iSiore Load Store Load Store
CLFLUS N ! N Y Y Y Y Y K
Note:. N = Cannot pass, x - can pass.
Table 1, Memory ordering of instructions with respect to an older CLFLUSH

1 Earlier access Later access
CLFLUSH
UC memory Load Y
Store Y
WB memory Load Y
Store Y
USWC memory Load Y
Store Y
ii CLFLUSH Y
i MFENCE N
5 Table 2: Memory ordering of instructious with respect to a younger CLFLUSH
From the Memory-ordering unit 110, die "clflush_micro_op" micro-operation is dispatched to the Ll cache controller unit 120, block 210. The "clflush_micro_op" micro-operaaon is dispatched on the load port; howevert it is allocated in a write combining buffer 132, as if it were a store, from the L1 cacue controller unit forward,
10 the "clflush_micro opM is switched from the load to the store pipe.

Decision block 212 determines whether no write combining buffers 132 are available. If none are available, flow returns to block 210. Otherwise, How continues into block 214. Regardless of the memory type and whether it hits or misses the LI
cache, a write combining buffer 132 is allocated to service an incoming
i
15 “clflush_micro op," block 214. A control field is added to each write combining buffer
— i
132 in the LI cache controller unit to determine which self-snoop attributes need to be
sent to the bus controller 140. This control bit, named “clflush_miss," is set exclusively

for a "clfiushjrwrojop” that misses the LI cache.
Upon entering the memory sub-system of the microprocessor, several bits of the
20 address that enable cache line access of a "clflush_micro_op” are zeroed out, block 216. In the Pentium pro family of microprocessors, these would be the lower five bits of the

address (address[4:0]). This is done in both the LI cache and L2 cache controller units
120, upon executing the flush command. The zeroing out helps to determine a cache
line hit or miss. Since only tag match determines a hit or miss, no byte enable
comparison is needed. Note that by definition, no partial hit is possible. A hit or miss is
5 always full line hit or miss. Zeroing out the address bits [4:0] also provides an
altemative mechanism to the one used in the memory ordering unit II0 to mask line
split accesses. In split accesses the data size of the transaction is masked one byte. i Another control bit added to each write combining bufTer 132 in the LI cache
controller unit 120 is used to differentiate between a write combining buffer 132
10 allocated for a “clflush micro op” and another one allocated for a write combining
i store, block 218. This control bit, named "clflush_op," is exclusively set for those write
combining buflers allocated to service a ,,clftushjnicra_op'\ It is used to select the
request type and flush attributes seat from the LI cache controller 120 to the bus
controller 140.
15 In the case of an LI cache hit, as determined by decision block 222, both "flush
LI" and-'flush L2" attributes are sent to the bus controller 140 upon dispatch from the
LI cache controller unit 120, blocks 224 and 226. The bus controller 140 contains both
the L2 cache 146 and external bus controller units.
Alternatively, in the case of a LI cache miss, as determined by decision block
20 222,the "clflush_miss" control bit is set, and only the "flush L2" attribute is sent blocks
228 and 232. This helps improve performance by omitting the internal self-snoop to the
LI cache.
Upon its dispatch from the memory-ordering unit 110, the "clflush micro_op"
micro-operation is blocked by the Ll cache controller unit 120 if there are no write
25 combining buffers 132 available, block 212. In such a case, it also evicts a write-
combining buffer 132, as pointed by the writs-combining circular allocation pointer.
This guarantees no deadlock conditions due to the lack of free write combining buffers
132. If blocked, the "clflush_micro_op" is redispatthed once the blocking condition is
removed. An example that would cause the dispatching of the ~clflush_micro_op"
30 instruction is the completed eviction of a previously allocated write-combining buffer
132.
7

The "cIflush_micro_op" micro-operation is retired by the memory subsystem upon being allocated into a write-combining buffer 132 in the LI cache controller 120. This allows pipelining: subsequent instructions to proceed with execution prior to completion of the "clflush_micro_op” micro-operation. The pipelining improves the
5 overall performance of the system-There are two methods to evict a write-combining buffer servicing a "clflush_micro_op" micro-operanon.
A write combining buffer 132 servicing a "clflush_micro_op" will be evicted by the same current eviction conditions that currently apply to write combining buffers 132
10 in the family of Intel P6 microprocessors. Moreover, fencing microinstructions also evict a write-combining buffer that services a "clflush_micro_op,* micro-operarion-
Additionally, some embodiments evict a “clflush_micro_op” exclusively. This is done to avoid leaving stranded (pending) a write combining buffer servicing a "clflush_micra_op" for a long period of time, when the programmer does not want to
15 enforce ordering, and a fencing instruction is not used A control bit, named "clflush-evict", is associated with each write-combining buffer 132 servicing a "clflush-micro-oprt. This control bit is set when a write combining buffer 132 is allocated to a "clflush-micro_op." Once the "cl2ush_evict" bit is set, the corresponding write-combining-buffer- is marked for eviction and the control bit is reset, block 230.
20 This eviction condition applies exclusively to write combining buffers 132 servicing a “clflush _micro-op” micro-operation. It improves performance of programs using CLFLUSH by not allowing "clflush- micro- op" micro-operations to take up the write combining buffer 132 resources for extended periods of time, and consequently, freeing them up for other write combining operations.
8

Clflu*b_rai5£ " control bit HCmush_op" cootrol bit Request type uFlwbil" attribute "Flush U* attribute New transaction
'0 '0 Non-CLFUJSK - ■ -■ NO
'0 l CLFLUSH *1 '1 YES
1 0 N/A N/A N/A Illegal combination
•1 *\ CXFLUSH . *Q •I YES
Table 3; Memory to Bus Transactions for CLFLUSH
Note that if *Clflush-miss" = "clflush_op" = ‘0’ the request type is any of the existing transactions in the P6 family of microprocessors (but not CLFLUSH), and the
5 flush attributes will be set/cleared accordingly.
Table 4 below shows the condinons under which the three write combining buffer 132 control bits are set and reset. The "clfiush.evict" control bit can only be set after the "clfiush_micro_op” control bit The'-clflushjnicto_opn control bit will be set on speculative write combining buffer 132 allocations, while ',clflusb_evictt' will
10 exclusively be set on a real write combining buffer \ 32 allocation for a rtclflushjop". The "clflush miss" coraroi bit is also set on speculative write combining buffer 132 allocations, if me "clflush_micro_op'” misses the LI cache. Both, the "clflush_raiss" and "clflush_opw control bits arc cleared upon speculative allocation of a write-combiniag butter 132 to service any instruction other than a "clflush- micro- op."
15 Functionally, tins is suunartodearing such-«»ntrolbits.upottj^loc^onof a write-combining buffer servicing a ~clflushjnicto_op." la a processor implementation, the same write buffers 130 are shared for write combining and non-write combining micro operations. The “citlush_miss" and "clflushjnicrojap" bits are cleared upon speculative allocation of any write buffer 130, not just a write combining buffer 132.
20 This behavior ensures that the three control bits can never be set for a write buffer 130 not servicing a "clflush- micro-op." In a processor implementation, where all LI cache controller buffers are shared for both reads and writes, such as in the family of P6 microprocessors, the clflushjniss" and "clflus-micro-op” control bits only need to be cleared upon allocation of a buffer to service a store, block 234, Buffers allocated to
25 service loads ignore the value of these three new control bits.
9

Control bit Set Clear
' "Clflush_op" Upon allocation of a write combining buffer 10 service a **clfiusb_jnicro_op" Upon allocation of a write buffer for something ocher"[iiafl a "cIflush_nacro_op"
"Clflush_cv;c tnunediatciy after allocation of a write combining buffer to service a -ciflusiijnicro^op*' (LC, WC buffer allocated, "is use", and "clflushjop" control b« set) Upoa eviction of a write combining buffer (.i.c, "WC mode" control fart set)
~affusb_inis5 Upon allocation in a write combining buffer of a "clflusiwiucrojsp" that nusse* the LI cache "clflushjiuss" Upon allocation of a write buffer for aometbmg other than a "clflusb_jnicraj3p"
Note that all three new WC buffer control bits are cleared upon a "'reset" sequence as
well.
Table 4: Conditions to set/clear the new control bits of a write-combining buffer
5 in the LI cache controller
Embodiments may be implemented utilizing the bus controller 140. When a write-combining buffer 132 servicing a "clflush-micro-op” is marked for eviction, it is dispatched to the bus controller 140, block 236. The request sent is the same as if it was for a full line cacheable write combining transaction, except for the self-snoop attributes.
10 Snooping is used to verify the presence of 3 specific memory address is present in the
applicable cache. For a "clflusbjnicro__op" ev^c^the^uTc^nttttller"l:40-self=snoops - -the U and L2 caches based oatue 'flush LI1* and "flush 1-2" request attributes, block 250. furthermore, the bus controller 140 issues a "bus read invalidate line" on the external bus, block 236. If the LI cache controller unit. 120 determines an LI cache
15 miss, for example, no "flush LI" message is sent. The "bus read invalidate line" transaction Hushes hits to the same line in any other caches in the coherency domain. On the external bus transaction, all byte enables are deasserted, masking the data phase from the core. Decision blocks 23 8 and 252 determine whether a hit for a modified cache line (HTTM) has occurred in another cache within the coherency domain (i.e., not
20 the LI or L2caches in the requesting microprocessor). If die HTTM occurs, the cache that is hit does a write back to main memory, and data is returned to the requesting microprocessor in blocks 244 and 254, The write combining buffer 132 in the LI cache controller unit 120 remains allocated until completion of the snoop phase and possible
10

transfer of data back from another cache in the coherency domain, for example, a HITM on an external bus.; Data coming back to the write-combming buffer 132 as a result of the snoop phase or inquiry cycle Is ignored, blocks 246 and 248.
Ail flushes are then completed, and the write combining buffers 132 are
5 deallocated in block 260.
Table 5 below shows how the external bus controller 140 treats all write-combining evictions. The request from the LI cache to the bus controller 140 for a "clflush_micro_op" eviction, such as the CLFLUSH macro-instruction, can be overloaded on the same request signals as that for a full line cacheable write combining
10 eviction; however, the self-snoop attributes differ.
Table 5: External bus controller transactions for write combining evictions

Request Extern tl Tranta Byte FlwnL1 Flush L2 New
type buscranaacttoo ctioalength enable*
Pans*! Read 32 byre All byte NO NO NO
cacheable Invalidate enables
wroe 1 asserted
combining
Fullline Invalidate 32 byte AUbyta NO NO NO
cscheable easbles
wnie _ . dca»5cne4 __ „
combining
Partial Memory uncscheab WIUC byte as sent from [emporsl
Is write (wrne1 U cache swestfcat
combining type) controller miss LI cache
Full line Memory 32 byte Ail byte NO OslynoB- NO
vmcachtah write enables letnpofai
le write (writeback asserted stores that
combining yp*> nussLl cache
CLFIUS Bus Read 32byte All bys QntyU YES YES
H Invalidate enables deasscrted hta
Note: USWC stores are not memory aliased in the P6 family of microprocessors, and therefore, they are not self-snooped.

For testability arid debug purposes, a non-user visible mode bit can be added to enable/disable the CJLFLUSH macroinstruction. If disabled^ the LI cache controller mm 120 treats the incoming “cltlush_micro_op" micro-operation as a No-Operation-Opcode
(“NOP"), and it never allocates a write-combining buffer 132. This NOP behavior can
i
5 be implemented on uncacheable data prefetches
The previous! description of the embodiments is provided to enable any person skilled in the art to make or use the system and method. It is well understood by those in the an, that the, preceding embodiments may be implemented using hardware, firmware, or instructions encoded on a computer-readable medium. The various
10 modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
15
12

We Claim:
1. A process comprising:
receiving a memory address by a cache controller;
determining whether the memory address is stored within the closest cache memory in the coherency domain;
flushing the cache line associated with the memory address stored within the closest cache memory in the coherency domain;
allocating a flush instruction with the memory address to a write combining buffer;
transmitting the flush instruction with the memory address to a bus controller through the write combining buffer;
locating instances of the memory address stored within external cache memory within the coherency domain; and
flushing instances of the memory address stored within the external cache memory within the coherency domain.
2. The process of claim 1, wherein a first control bit indicates to the bus controller whether a write combining buffer is allocated to the flush instruction with the memory address.
3. The process of claim 2 wherein a second control bit indicates to the bus controller whether the memory address is stored within the closest cache memory in the coherency domain.

A PROCESS OF CACHE MANAGEMENT

Documents:

Inventors:

PCT Conventions: