本文主要对多核CPU的缓存架构、缓存一致性的相关概念做了简要介绍,同时介绍了Valid/Invalid、MSI、MESI等缓存一致性协议以及Store Buffer和Invalidate Queue对缓存一致性造成的破坏以及解决办法。
1. CPU的缓存
在计算机中,存储体系是一个典型的金字塔结构,按照速度排列从上到下依次是:CPU 寄存器、CPU Cache(L1/L2/L3)、内存(主存)、SSD 固态硬盘以及 HDD 传统机械硬盘。越上层的存储设备速度越快,当然价格也更贵,容量也越小。
Linux查看CPU的缓存的指令,
lyf@DESKTOP-MCH0FUN:~$ cat /sys/devices/system/cpu/cpu0/cache/index
index0/ index1/ index2/ index3/
# L1数据缓存
lyf@DESKTOP-MCH0FUN:~$ cat /sys/devices/system/cpu/cpu0/cache/index0/size
32K
# L1指令缓存
lyf@DESKTOP-MCH0FUN:~$ cat /sys/devices/system/cpu/cpu0/cache/index1/size
32K
# L2
lyf@DESKTOP-MCH0FUN:~$ cat /sys/devices/system/cpu/cpu0/cache/index2/size
1024K
# L3
lyf@DESKTOP-MCH0FUN:~$ cat /sys/devices/system/cpu/cpu0/cache/index3/size
16384K
CPU缓存和内存之间是通过映射的策略进行关联的。CPU缓存从内存读取到的数据是一块一块存储的,这一块可以理解为 CPU Cache 的最小缓存单位,它有一个专门的名字:Cache Line,一般它的大小为 64 Byte。
The chunks of memory handled by the cache are called cache lines. The size of these chunks is called the cache line size. Common cache line sizes are 32, 64 and 128 bytes.
查询Cache line的大小,
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
?
?
如上图所示,对于每个CPU的cache,
- cache中包含了很多的cache line,之所以叫cache line,应该也是因为看起来是一行一行的数据;
- 每个cache line中有state,tag,data等,tag一般是数据的地址,data中是对应的值;
- 每个cache有个专门的硬件叫cache controller,cache控制器同时和CPU以及总线打交道。
关于缓存控制器(cache controller),它是一个硬件,在CPU和共享总线之间协作,
- 将主存中的code指令或者数据读到cache中;
- 处理CPU的load和store指令,发送和处理总线消息(缓存命中、不命中)。
The cache controller is a hardware block responsible for managing the cache memory, in a way that is largely invisible to the program. It automatically writes code or data from main memory into the cache. It takes read and write memory requests from the core and performs the necessary actions to the cache memory or the external memory.
When it receives a request from the core, it must check to see whether the requested address is to be found in the cache. This is known as a cache look-up. It does this by comparing a subset of the address bits of the request with tag values associated with lines in the cache. If there is a match, known as a hit, and the line is marked valid, then the read or write occurs using the cache memory.
When the core requests instructions or data from a particular address, but there is no match with the cache tags, or the tag is not valid, a cache miss results and the request must be passed to the next level of the memory hierarchy, an L2 cache, or external memory. It can also cause a cache linefill. A cache linefill causes the contents of a piece of main memory to be copied into the cache. At the same time, the requested data or instructions are streamed to the core. This process occurs transparently and is not directly visible to a software developer. The core need not wait for the linefill to complete before using the data. The cache controller typically accesses the critical word within the cache line first. For example, if you perform a load instruction(读为load,写为store) that misses in the cache and triggers a cache linefill, the core first retrieves that part of the cache line that contains the requested data. This critical data is supplied to the core pipeline, while the cache hardware and external bus interface then read the rest of the cache line, in the background.
2. 缓存一致性
带有高速缓存的CPU执行计算的流程,
- 程序以及数据被加载到主内存;
- 指令和数据被加载到CPU的高速缓存;
- CPU执行指令,把结果写到高速缓存;
- 高速缓存中的数据写回主内存。
而将数据从cache中写入到内存中有2种方式,
- Write through,直写法,CPU在写cache的同时也会立即去写主存;
- Write back,写回法,只写cache,在延后的“合适的时间”再去写主存。Write through每次都要把数据更新到主存中,在有些场景下是不需要的(例如单线程程序或者多线程的私有数据),影响效率。
The benefit of write-through to main memory is that it simplifies the design of the computer system. With write-through, the main memory always has an up-to-date copy of the line. So when a read is done, main memory can always reply with the requested data.
If write-back is used, sometimes the up-to-date data is in a processor cache, and sometimes it is in main memory. If the data is in a processor cache, then that processor must stop main memory from replying to the read request, because the main memory might have a stale copy of the data. This is more complicated than write-through.
Also, write-through can simplify the cache coherency protocol because it doesn't need the Modify state. The Modify state records that the cache must write back the cache line before it invalidates or evicts the line. In write-through a cache line can always be invalidated without writing back since memory already has an up-to-date copy of the line.
One more thing - on a write-back architecture software that writes to memory-mapped I/O registers must take extra steps to make sure that writes are immediately sent out of the cache. Otherwise writes are not visible outside the core until the line is read by another processor or the line is evicted.
总而言之,Write through效率低,但是系统设计起来会更简单一些。而Write back是为了提升CPU的效率而做的一些优化,使得系统会更复杂。从上面我们可以看出,在多核CPU上的多线程编程,在牵扯到多线程之间有共享变量时,这个变量的最新值可能在主存中,可能在某个CPU的cache中,此时缓存的“一致性”就变的非常重要,因为如果不一致,我们可能就读取了错误的值。
要达到缓存一致性的目的,学术上有2个必须的要求:
- Write Propagation,写传播,某个CPU核心里的cache数据更新时,必须要传播到其他核心的cache;
- Transaction Serialization,事务串行化,Reads/Writes to a single memory location must be seen by all processors in the same order,对一个内存地址的读/写,对所有的CPU,其执行顺序是一致的。假如有个4核的CPU,在每个核心的cache上都缓存了i,当CPU0和CPU1同时对自己缓存中的i执行写操作时(store,这个“同时”指的是同一个时间点,这对多核CPU理论上是存在可能的,CPU0写入i=1,CPU1写入i=2),对CPU2和CPU3缓存中的i的值应该如何去更新,是1-2还是2-1?Transaction Serialization就是对这点做了约束,必须保证这种情况下,缓存的一致性。不同的缓存一致性协议,例如MESI等,通过自己的协议机制,规定了如何将事务串行化,此时,我理解,就不存在真正并行的对变量i的写操作。
目前有2种主流的处理缓存一致性的机制:
- Snooping,基于总线嗅探(本文介绍的内容都是基于Snooping),
First introduced in 1983,snooping is a process where the individual caches monitor address lines for accesses to memory locations that they have cached.The write-invalidate protocols and write-update protocols make use of this mechanism.
基于总线嗅探的技术,其核心是cache控制器随时关注自己和其他CPU对自己缓存中的数据的操作,然后做出对应的反应。
- Directory-based,基于目录,
In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed, the directory either updates or invalidates the other caches with that entry.
3.缓存一致性协议
The protocol must implement the basic requirements for coherence. It can be tailor-made for the target system or application.
Various models and protocols have been devised for maintaining coherence, such as MSI, MESI (aka Illinois), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol.In 2011, ARM Ltd proposed the AMBA 4 ACE[11] for handling coherency in SoCs. The AMBA CHI (Coherent Hub Interface) specification from ARM Ltd, which belongs to AMBA5 group of specifications defines the interfaces for the connection of fully coherent processors.
协议的作用就是制定标准,缓存一致性协议有很多,例如MSI、MESI、MOESI等,各有优缺点,其核心目的是保证缓存的一致性,在某些方面做了一定的优化(当然,这些优化可能带来新的问题)。
接下来,将从最简单的协议开始,然后逐步分析其优缺点,直到MESI协议的产生,以下的协议都是基于Bus Snooping,
?
?
共享总线,shared Bus,所有的内存请求通过广播发送到总线上,这些消息在总线上是排好队的(in order,Transaction Serialization),这样就保证了所有的CPU看到的load和store操作的顺序都是一模一样的。
3.1 Valid/Invalid协议
接下来首先介绍一个最简单的协议,Valid/Invalid(VI) ,假设写缓存是write-through的,写缓存的同时写主存。
注意,
- 本文中的协议都是以有限状态机来描述(fsm);
- CPU侧的操作都是以Pr开头,有读和写,PrRd,PrWr;
- 总线上的消息(bus transaction)以Bus开头,有BusRd,BusWr,BusRdX等。为了简化,有些消息在图中并没有标注,例如BusInv、BusReply等等。例如,当缓存控制器收到BusRd消息时,其实是会回复BusReply消息,当收到BusInv消息,将自己对应的cache line状态置为invalid后,也会回复对应的invalid ack消息;
- 图中实线部分是CPU主动发起的操作,而虚线部分是cache line处在某个状态时收到对应的总线消息后状态的转换。
?
?
- 初始Invalid状态,cpu load,PrRd,产生BusRd消息,从主存中读取到值后,放在cache中,状态变为Valid;
- 在Valid状态下,CPU load,PrRd,不会产生bus消息,直接读缓存;
- 在Valid状态下,CPU store,PrWr,会产生BusWr消息,当前cache的状态依然是Valid;
- 其他cache控制器收到BusWr的消息后,发现有人正在写自己cache line中的值,则将cache line的对应状态置为Invalid;
- 在Invalid状态下,CPU store,PrWr,产生BusWr,状态依然是Invalid。因为是write-through cache,所以直接写主存;
举个实际的例子,双核CPU,CPU0,CPU1, 初始状态,0xA地址处变量的值为2,
?
?
?
?
①CPU0 load 0xA地址的值,发送BusRd 0xA,状态改为V,缓存的值设置为2;
?
?
②CPU1 load 0xA地址的值,发送BusRd 0xA,状态改为V,缓存的值设置为2。此时,如果CPU0 和CPU1要继续读取0xA的值,直接从cache中读取,不会产生总线消息;
?
?
③CPU 0写0xA的值为3,发送BusWr,write-through,直接写缓存和主存。CPU 1收到BusWr,将自己的状态置为I。
?
?
④CPU1 load,发送BusRd,状态改为V,读取到最新的值3。
从以上的例子,我们可以看到,VI协议非常简单,但是其保证了缓存的一致性。但是它的缺点也比较明显,
- 每次写操作(PrWr)都会去写主存,效率低;
- 每次写操作(PrWr)都会发送总线消息(BusWr),浪费总线的带宽。
3.2 MSI协议
MSI是Modified,Shared,Invalid三个首字母的缩写,
- I,cache does not contain the address
- S,cache has the address but so many other caches, hence it can only be read
- M,only this cache has the address hence it can be read and written;any other cache that had this address got invalidated
I指的是无效状态。S指多个缓存都有该地址的值,且都是一样的,只能读。M指该cache line中的值是最新被修改过的,其他CPU中对应cache line的状态已经被置为I。
?
?
其中,
- BusRdx,Bus Read Exclusive , I get exclusive copy of this location into my cache,告诉其他的缓存我要独占这个cache line中对应的数据了(写),发送BusRdx,其他当前在M和S状态的cache line,都会转移到I状态。
- 在M状态下收到BusRd消息(其他CPU有读操作),当前cache line的数据是最新的,因此会触发BUSWB(Bus WriteBack),将最新的数据写入到主存中。
- 在M状态下的PrWr可以直接写,不会产生总线消息,比VI协议效率高一些。
依然以上面的例子举例,双核CPU,CPU0,CPU1,假设写缓存是write-back的。 初始状态,0xA地址处变量的值为2,
?
?
① CPU0 load, 触发BusRd ,状态从I变为S,数据为2。
?
?
②CPU1 load, 触发BusRd ,状态从I变为S,数据为2。当缓存中已经有值了以后,其他的load都从缓存中直接读数据,不会触发bus transcation。
?
?
③CPU0 store,PrWr ,触发BusRdX,CPU1的缓存控制器收到BusRdX后,将对应cache line的状态置为I,CPU0的缓存控制器将对应的cache line状态置为M。这个时候, 并没有更新主存中的数据(write back cache)。此时,CPU0对0xA的读写都是locally的,不会触发bus transcation(不像VI协议)。
?
?
④CPU1 store,PrWr,触发BusRdX;CPU0当前在M状态,收到BusRdX后,会触发BusWB,将自己缓存中的值写回到主存中,此时主存中的值已经更新为3,同时CPU0中对应的cache line的状态更新为I。CPU1将cache line的值更新为最新的10,然后状态更新为M。
?
?
⑤CPU0 load,PrRd,触发BusRd。CPU1当前状态为M,收到BusRd后,会触发BusWB,将对应的cache line的值write back到主存,更新自己的状态为S。
CPU0 读取到最新的值10,状态从I转为S。
3.3 MESI协议
对于MSI协议,考虑一种情况,单线程或者多线程的私有数据,即多个CPU的cache中没有共享数据,此时执行read-modify-write操作,此时也会触发2次bus transcation,即BusRd和BusRdX。MESI协议在MSI的基础上添加了E状态(exclusive,clean),该cache line中的数据和主存中的数据是一样的,且只有这个cache line中有这个数据。所以叫“独占”。
Exclusive: The cache line is present only in the current cache, but is clean - it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.
?
?
从MESI协议的状态转移图我们可以看出,
- 在I状态,PrRd,如果没有其他的sharer ,则进入E状态,否则进入S状态。
- 当没有其他sharer时,读,进入E状态,写,PrWr,进入M状态,没有触发其他的bus transcation。对于单线程或者多线程的私有数据的读写,减少了一条bus transcation(因为图中有些总线消息没有都标出来,所以实际减少的总线消息不止一条),效率较MSI协议有了提升。
在论文《Memory Barriers: a Hardware View for Software Hackers》 中介绍了几种MESI协议对应的总线消息,包括,
Read: The “read” message contains the physical address of the cache line to be read.
Read Response: The “read response” message contains the data requested by an earlier “read” message. This “read response” message might be supplied either by memory or by one of the other caches. For example, if one of the caches has the desired data in “modified” state, that cache must supply the “read response” message.
对应上文中的BusRd,BusReply,只不过本文对BusReply的消息都未显示。
Invalidate: The “invalidate” message contains the physical address of the cache line to be invalidated. All other caches must remove the corresponding data from their caches and respond.
Invalidate Acknowledge: A CPU receiving an “invalidate” message must respond with an “invalidate acknowledge” message after removing the specified data from its cache.
Invalidate,BusInv消息,Invalidate Ack消息是对BusInv消息的回复,可以统一囊括进BusReply中。
Read Invalidate: The “read invalidate” message contains the physical address of the cache line to be read, while at the same time directing other caches to remove the data. Hence, it is a combination of a “read” and an “invalidate”, as indicated by its name. A “read invalidate” message requires both a “read response” and a set of “invalidate acknowledge” messages in reply.
对应BusRdX消息,相当于是Read和Invalidate的结合。
Writeback: The “writeback” message contains both the address and the data to be written back to memory (and perhaps “snooped” into other CPUs’ caches along the way). This message permits caches to eject lines in the “modified” state as needed to make room for other data.
对应BusWB消息。
下文中会对Read、Read Response、Invalidate、Invalidate Acknowledge、Read Invalidate以及BusRd、BusReply、BusRdX等混用,其实都是一样的意思,只不过下文主要是基于论文中的内容进行描述,会使用论文中的术语。
4. 对MESI协议的进一步优化
对于MESI协议,假如cache line处于S状态,当CPU执行写(store)操作时,PrWr,会触发总线发送BusRdX,其他缓存控制器收到该消息后将自己对应的cache line的状态置为I后,然后会回复BusReply(Invalidate Ack)消息,当前缓存的控制器只有在收到BusReply(Invalidate Ack)消息后才会去执行写操作。这一段时间对于CPU而言,时间很长,效率很低,称为"write stall"。
?
?
此外,对于CPU1而言,将cache line置为I状态有时也是比较耗时的。因为CPU1可能正在对该缓存进行密集的读写操作,这个时间可能比较长。如果CPU1同时收到了很多的Invalidate消息的时候(CPU1有多个cache line),它对这些消息的处理必然有个先后顺序,而对应的cache line可能处理的会比较晚。
One reason that invalidate acknowledge messages can take so long is that they must ensure that the corresponding cache line is actually invalidated, and this invalidation can be delayed if the cache is busy, for example, if the CPU is intensively loading and storing data, all of which resides in the cache. In addition,if a large number of invalidate messages arrive in a short time period, a given CPU might fall behind in processing them, thus possibly stalling all the other CPUs.
为了解决上面提到的2个问题,因此有了Store Buffer和Invalidate Queue。在有Store Buffer和Invalidate Queue之前,无论是VI/MSI/MESI协议本身,都可以保证缓存的一致性,只是在某些场景下效率的高低不同。
4.1 Store Buffer和Store Forwarding
?
?
One way to prevent this unnecessary stalling of writes is to add “store buffers” between each CPU and its cache, as shown in Figure 5. With the addition of these store buffers, CPU 0 can simply record its write in its store buffer and continue executing. When the cache line does finally make its way from CPU 1 to CPU 0, the data will be moved from the store buffer to the cache line.
如上图所示,在CPU和缓存之间加一个中间的小的缓存"store buffer"(该图中store buffer的数据流向是单向的,CPU只能向其中写数据,无法读数据)。当CPU0需要去store的时候,先将x写入到"store buffer"中,然后继续执行其他的事情,等到需要的时候再将对应的数据写入到缓存中(收到CPU1对应x的Invalidate ack消息后)。 一个实际的例子,a和b初始都为0,a在CPU1的缓存中(E),b在CPU0的缓存中(E),
a = 1;
b = a + 1;
assert(b == 2);
假设上面的代码在CPU0中运行,会有下面可能的执行顺序,
- CPU0 开始执行 a = 1;
- CPU0 在自己的缓存中找a,未找到。因为要执行写(Store)操作,此时CPU0发送“Read Invalidate”消息;
- CPU0将 a=1放到CPU 0 的store buffer中;
- CPU1收到“Read Invalidate”消息,回复Read Response消息(a=0),并将对应的a相关的cache line的状态置为I;
- CPU0收到CPU1发的Read Response消息,将自己缓存中a的值置为0。
- CPU0开始执行 b = a + 1,从缓存中读出来a的值为0,算术运算后,b = 1。因为b在CPU0的缓存中,状态为E,直接修改缓存中的值为1;
- 假设这时候,CPU0才收到CPU1回复的Invalidate ack消息,这时候CPU0才从"store buffer"中将a的值拿出来写入到CPU0的缓存中。
- CPU0执行assert,因为缓存中b的值为1,assert失败。
上面代码失败的原因是在CPU 0中有2份a的值,一份在"store buffer"中,一份在缓存中。为了解决这个问题,就有了"store forwarding",从硬件上解决这个问题。"store forwarding"保证CPU在执行读(load)操作时同时兼顾"store buffer"和cache,保证读的正确性,系统结构图变成如下,注意数据流向的箭头。
?
?
A store buffer is used when writing to an invalid cache line. As the write will proceed anyway, the CPU issues a read-invalid message (hence the cache line in question and all other CPUs' cache lines that store that memory address are invalidated) and then pushes the write into the store buffer, to be executed when the cache line finally arrives in the cache.
A direct consequence of the store buffer's existence is that when a CPU commits a write, that write is not immediately written in the cache. Therefore, whenever a CPU needs to read a cache line, it first scans its own store buffer for the existence of the same line, as there is a possibility that the same line was written by the same CPU before but hasn't yet been written in the cache (the preceding write is still waiting in the store buffer). Note that while a CPU can read its own previous writes in its store buffer, other CPUs cannot see those writes until they are flushed to the cache - a CPU cannot scan the store buffer of other CPUs.
CPU从cache line中读数据时,首先会去其store buffer中查询有没有对应项,因为有可能之前的写操作没完成,缓存中的值是旧值。
4.2 Invalidate Queue
考虑下面的一个场景,假设a在CPU1的cache中,b在CPU0的cache中,初始值都为0,CPU0执行foo(),CPU1执行bar()。
void foo(void)
{
a = 1;
b = 1;
}
void bar(void)
{
while(b == 0) continue;
assert(a == 1);
}
- CPU 0执行a = 1,cache未命中,发送read invalidate消息,将a = 1放置在store buffer中;
- CPU 1执行while(b == 0),读b,发现cache未命中,发送read消息;
- CPU 0执行 b = 1,因为缓存中已经有了b,状态肯定是E或者M,直接修改缓存 b = 1;
- CPU 0收到读b的read消息,回复Read Response消息(b = 1),状态置为S;
- CPU 1收到Read Response消息(b = 1),将其缓存中b的值修改为1。,此时读到的b值为1,因此while(b == 0),退出;
- CPU 1执行assert(a == 1),此时CPU 1中的缓存中a的值还是0,assert失败;
- 到了这个时候,CPU 1才收到CPU 0发送的read invalidate消息,把缓存中的a =0 发送过去(Read Response),然后把CPU 1中的a对应的cache line状态置为I,同时当然还会发送Invalidate ack消息;
- CPU 0收到a = 0的消息,写缓存。同时收到Invalidate ack,将store buffer中a = 1写入到缓存中。
上面的问题,可以利用内存屏障(memory-barrier)的指令来进行纠正,下面是linux中的一些barrier API,
smp_mb(),All memory accesses before the smp_mb() will be visible to all cores within the SMP system before any accesses after the smp_mb().
smp_rmb(),Like smp_mb(), but only guarantees ordering between read accesses.
smp_wmb(),Like smp_mb(), but only guarantees ordering between write accesses.
在smp_mb()前执行的读写操作,在smp_mb()后都是可见的,smp_rmb和smp_wmb分别对应读、写相关的操作。而smp_mb将导致CPU对store buffer的“清空”,即等着把之前缓存在store buffer中的数据先处理完。具体实现有2种方式:一直等着,直到store buffer缓存的数据处理完;把smp_mb()后的写操作(store)继续放到store buffer中排队(前面的数据先处理,最后处理后加入的)。
将上述CPU 0执行的代码加上读写barrier后(假设是后一种处理方式),
void foo(void)
{
a = 1;
smp_mb();
b = 1;
}
其可能执行的流程如下,
- CPU 0执行a = 1,cache未命中,发送read invalidate消息,将a = 1放置在store buffer中;
- CPU 1执行while(b == 0),读b,发现cache未命中,发送read消息;
- CPU 0执行 smp_mb(),会将store buffer中的a = 1做上标记;
- CPU 0执行 b =1,因为b初始在CPU 0的cache中,本来应该直接可以写cache的,但是因为在store buffer中有标记的项,此时将b = 1也放在store buffer中的最后位置,但是不加标记;
- CPU 0收到CPU 1发送的read消息(读b),发送Read Response消息(b = 0),将对应cache line中的状态置为S;
- CPU 1收到Read Response消息(b = 0),写缓存b = 0,将对应cache line中的状态置为S,导致while循环继续;
- CPU 1收到Read invalidate消息,将缓存中a = 0发送给CPU 0(Read Response消息(a = 0)),同时将缓存中a相关的cache line 状态置为I;
- CPU 0收到 Read Response消息(a = 0),结合store buffer中a = 1,将对应a的cache line的状态置为E,同时修改缓存中a的值为1。因为store buffer中已经处理完a相关的,可以继续处理b=1相关的,但是因为此时缓存中b的cache line的状态为S,先不能直接写(CPU1中有b对应的cache line,状态为S),在写之前需要先发送Invalidate 消息,让CPU1中的b对应的cache line失效;
- CPU 1收到Invalidate 消息,回复Invalidate ack消息,将b对应的cache line状态置为I;
- CPU 1执行while(b == 0),读b,发现cache未命中,发送read消息;
- CPU 0收到Invalidate ack消息后,将b对应的cache line的状态置为E,这时候处理store buffer中b=1的,将b=1写入缓存;
- CPU 0收到read消息,回复Read Response消息(b = 1),将b对应的cache line状态置为S,值依然为1;
- CPU 1收到Read Response消息(b = 1),while退出;
- CPU 1执行assert(a ==1),发现其a对应的缓存未命中(7中置为I),接着将会发送read消息,从CPU 0中拿到最新的缓存的数据a=1,assert成功。
通过上面的例子,我们发现添加smp_mb后,当线程1读取到b = 1时(while退出),此时一定能保证a = 1在CPU 1中能看到,完成了同步的效果(换而言之,smp_mb保证了前面的执行的代码一定先于其后执行的代码,如果smp_mb之后的代码执行了,之前的代码一定也执行了)。
但是,使用store buffer也有一个问题,它的容量是十分有限的,如果其容量满了以后,我们就不得不等待其先执行其中的一些数据,给后面的数据留空间。这种情况在使用了类似smp_mb的内存屏障后会发生的越明显,因为这种情况下smp_mb后的store操作也都要放到store buffer中,而store操作必须等收到Invalidate ack消息后才能进行(无论在当前的缓存中是否已经有store buffer对应项,是否缓存命中,如果对应的缓存是S状态还必须先把其他CPU中的cache line的状态置为I才能继续写),如果其他CPU的Invalidate ack消息回复的慢,则对CPU的执行效率将会大大影响。
解决上述问题的一个思路就是如何提高Invalidate ack消息的回复效率,而其解决思路就是Invalidate Queue。
the CPU need not actually invalidate the cache line before sending the acknowledgement. It could instead queue the invalidate message with the understanding that the message will be processed before the CPU sends any further messages regarding that cache line.
CPU在收到invalidate消息后,将其放入到其Invalidate Queue队列中,然后马上回复invalidate ack消息。CPU自己非常清楚:在发送Invalidate Queue队列中数据的cache line相关的消息之前,把对应数据的cache line使其失效即可(I)。提高了双方CPU的执行效率。
A store barrier will flush the store buffer, ensuring all writes have been applied to that CPU's cache. A read barrier will flush the invalidation queue, thus ensuring that all writes by other CPUs become visible to the flushing CPU.
store barrier指令将会对store buffer进行刷新,保证其内的数据写进CPU的cache line中。read barrier会刷新invalidate queue,保证了其他CPU对队列中对应项的写操作对当前CPU可见。
CPU的缓存架构图变成了如下图所示,
?
?
继续上面的例子,假设a在CPU1的cache中,b在CPU0的cache中,初始值都为0,CPU0执行foo(),CPU1执行bar(),同时拥有store buffer和invalidate queue的CPU可能的执行情况如下,
void foo(void)
{
a = 1;
b = 1;
}
void bar(void)
{
while(b == 0) continue;
assert(a == 1);
}
- CPU 0执行a = 1,cache未命中,发送read invalidate消息,将a = 1放置在store buffer中;
- CPU 1执行while(b == 0),读b,发现cache未命中,发送read消息;
- CPU 0执行b = 1,因为b在CPU 0的cache中(M或者E),直接修改b对应的缓存为1;
- CPU 0收到read消息,发送Read Response消息(b = 1),并将b对应的cache line的状态置为S;
- CPU 1收到read invalidate消息,将a放到invalidate queue中,马上回复invalidate ack。此时CPU1中对应的a的值还为0(还没有被置为I);
- CPU 1收到Read Response消息(b = 1),将其放置到缓存中;
- CPU 1之前执行的while(b == 0),从while循环中退出了,因为读上来的b=1;
- CPU 1执行 assert(a == 1),因为此时CPU1中对应的a的值还为0,所以assert失败;
上述情景assert失败的原因是invalidate queue导致CPU 1中读到的a值是旧值,可以用内存屏障指令解决上面的问题,
void foo(void)
{
a = 1;
smp_mb();
b = 1;
}
void bar(void)
{
while(b == 0) continue;
smp_mb();
assert(a == 1);
}
- 在foo中, smp_mb()的作用保证如果读到b=1了,此时a=1一定对当前线程(当前CPU)是可见的,即a=1先于b=1生效;
- 在bar中, smp_mb()的作用是将invalidate queue中的数据进行标记,如果在smp_mb()之后有其他的读操作(load),必须保证先把invalidate queue中的数据处理完,因此如果CPU 1读到b=1后,assert(a==1)才会去读a的最新的值,而foo中smp_mb()又保证了此时a一定为1,因此assert成功。
此外,在foo中只牵扯写操作(store),在bar中只牵扯读操作(load),因此上面的代码可以将smp_mb()替换为相当“更轻量化”的smp_wmb()和smp_rmb(),
void foo(void)
{
a = 1;
smp_wmb();
b = 1;
}
void bar(void)
{
while(b == 0) continue;
smp_rmb();
assert(a == 1);
}
smp_rmb() guarantees that all the LOAD operations specified before the barrier will appear to happen before all the LOAD operations specified after the barrier with respect to the other components of the system.
smp_wmb() guarantees that all the STORE operations specified before the barrier will appear to happen before all the STORE operations specified after the barrier with respect to the other components of the system.
通过以上的分析,为了提高CPU的执行效率,在有Store Buffer和Invalidate Queue之后,即使有MESI协议,缓存的一致性也遭到了破坏。我们不得不通过一些其他的手段,例如内存屏障相关的指令或者函数让程序员在软件层面去做同步。接下来的一篇文章,将会介绍C11/C++11在多线程编程方面保证缓存一致性的手段,主要包括原子类型、原子操作、memory order、内存屏障等。
本文暂时没有评论,来添加一个吧(●'◡'●)