Bluespec technical deep dive

Bluespec technical deep dive

Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 15, 2017 6.175 L22-1 SC and caches Caches present a similar problem as store buffers stores in one cache will not be visible to other caches automatically Cache problem is solved differently caches are kept coherent P P Cache

Cache Memory How to build coherent caches is the topic of this lecture November 13, 2017 6.175 L21-2 Cache-coherence problem P1 A 100 P2 200 cache-1 A 100 cache-2 Processor-Memory Interconnect A 100

200 memory Suppose P1 updates A to 200. write-back: memory and P2 have stale values write-through: P2 has a stale value Do these stale values matter for programming? Yes, if we want to implement SC or, in fact, any reasonable memory model November 15, 2017 6.175 L22-3 Shared Memory Systems P L1 P L1 P L1 L2

P L1 P L1 P L1 L2 Interconnect M Modern systems often have hierarchical caches Each cache has exactly one parent but can have zero or more children Logically only a parent and its children can communicate directly Inclusion property is maintained between a parent and its children, i.e., Because usually a Li a Li+1 L >> L November 15, 2017 6.175 i+1

i L22-4 Cache-Coherent Memory req res ... req res Monolithic Memory A monolithic memory processes one request at a time; it can be viewed as processing requests instantaneously A memory with hierarchy of caches is said to be coherent, if functionally it behaves like the monolithic memory November 15, 2017 6.175 L22-5 Maintaining Coherence

In a coherent memory all loads and stores can be placed in a global order multiple copies of an address in various caches can cause this property to be violated This property can be ensured if: Only one cache at a time has the write permission for an address No cache can have a stale copy of the data after a write to the address has been performed cache coherence protocols are used to maintain coherence November 15, 2017 6.175 L22-6 Cache Coherence Protocols Write request: the address is invalidated in all other caches before

the write is performed Read request: if a dirty copy is found in some other cache then that value is written back to the memory and supplied to the reader. Alternatively the dirty value can be forwarded directly to the reader Such protocols are called Invalidation-based November 15, 2017 6.175 L22-7 State and actions needed to maintain Cache Coherence Each line in each cache maintains MSI state: I - cache doesnt contain the address S- cache has the address but so may other S caches; hence it can only be read M- only this cache has the address; hence it can be read and written I

M store write-back Action on a read miss (i.e., Cache state is I): If some other cache has the address in state M then write back the dirty data to Memory and set its state to S Read the value from Memory and set the state to S Action on a write miss (i.e., Cache state is I or S): Invalidate the address in other caches; in case some cache has the address in state M then write back the dirty data Read the value from Memory if necessary and set the state to M November 15, 2017 How do we know the state of other caches? 6.175

L22-8 Protocols are distributed! Fundamental assumption A processor or cache can only examine or set its own state The state of other caches is inferred or set by sending request and response messages Each parent cache maintains information about each of its child cache in a directory November 15, 2017 Directory information is conservative, e.g., if the directory say that the child cache c has a cache-line in state S, then cache c may have the address in either S or I state but not in M state Sometimes the state of a cache line is transient because it has requested a change. Directory also contains information about outstanding messages 6.175

L22-9 Directory State Encoding Two-level (L1, M) system P P [S,a] L1 L1 P P P [S,a] L1 L1 Interconnect Ma Addr V Tag M/S dir

directory for a <[S, no],I,[S, no],I> Data Block -L1 has no directory -M has no need for MSI for each child <[(M|S|I), (No | Yes)]> Childs state November 15, 2017 Waiting for downgrade response 6.175 L22-10 November 15, 2017 6.175 lo a d

va li d Upgrade: A cache miss causes transition from a lower state to a higher state Downgrade: A write-back or invalidation causes a transition from a higher state to a lower state in I S sh flu re sto The states M, S, I can be thought of as an order M>S>I at e State ordering to develop

protocols store M write-back L22-11 Message passing an abstract view P p2m L1 PP P m2p c2m interconnect m2c m2p c2m m2c

in p2m PP L1 out PP m Each cache has 2 pairs of queues (c2m, m2c) to communicate with the memory (p2m, m2p) to communicate with the processor Message format: Req/Resp address state FIFO message passing between each (srcdst) pair except a Req cannot block a Resp Messages in one srcdst path cannot block messages in another srcdst path November 15, 2017 6.175 L22-12

Consequences of distributed protocol In the blocking-cache protocol we presented in L15, a cache could go to sleep after it issued a request for a missing line A cache may receive an invalidation request at any time from other caches (via its parent); such requests cannot be ignored otherwise the system will deadlock none of the requests may be able to complete A difficult part of the protocol design is to determine which request can arrive in a given state November 15, 2017 6.175 L22-13 Processing misses: Requests and Responses Cache Cache 3,5,7 1,5,8

3 1 5 2 6 2,6 2,4 Memory November 15, 2017 3,5,7 1,5,8 4 1 2 3 4 5 6 7 8

Up-req send (cache) Up-req proc, Up resp send (memory) Up-resp proc (cache) Dn-req send (memory) Dn-req proc, Dn resp send (cache) Dn-resp proc (memory) Dn-req proc, drop (cache) Voluntary Dn-resp (cache) 6.175 L22-14 CC protocol for blocking caches Extension to the cache rules for Blocking L1 design discussed in lecture L15 Code is somewhat simplified by assuming that cache-line size = one word syntax is full of errors November 15, 2017 6.175 L22-15

Req method hit processing method Action req(MemReq r) if(mshr == Ready); let a = r.addr; P let hit = contains(state, a); if(hit) begin p2m m2p let slot = getSlot(state, a); c2m let x = dataArray[slot]; PP L1 m2c if(r.op == Ld) hitQ.enq(x); else // it is store if (isStateM(state[slot]) dataArray[slot] <=; else begin missReq <= r; mshr <= SendFillReq; missSlot <= slot; end end else begin missReq <= r; mshr <= StartMiss; end // (1) endmethod November 15, 2017 6.175 L22-16

Start-miss and Send-fill rules Rdy -> StrtMiss -> SndFillReq -> WaitFillResp -> Resp -> Rdy rule startMiss(mshr == StartMiss); let slot = findVictimSlot(state, missReq.addr); if(!isStateI(state[slot])) begin // write-back (Evacuate) let a = getAddr(state[slot]); let d = (isStateM(state[slot])? dataArray[slot]: -); state[slot] <= (I, _); c2m.enq(m, a, I, d>); end mshr <= SendFillReq; missSlot <= slot; endrule rule sendFillReq (mshr == SendFillReq); let upg = (missReq.op == Ld)? S : M; c2m.enq(m, missReq.addr, upg, - >); mshr <= WaitFillResp; endrule // (1) November 15, 2017 6.175 L22-17 Wait-fill rule and Proc Resp rule Rdy -> StrtMiss -> SndFillReq -> WaitFillResp -> Resp -> Rdy rule waitFillResp ((mshr == WaitFillResp) &&& (m2c.first matches c, .a, .cs, .d>)); let slot = missSlot; dataArray[slot] <=

(missReq.op == Ld)? d :; state[slot] <= (cs, a); m2c.deq; mshr <= Resp; endrule // (3) rule sendProc(mshr == Resp); if(missReq.op == Ld) begin c2p.enq(dataArray[slot]); end mshr <= Ready; endrule November 15, 2017 6.175 L22-18 Parent Responds rule parentResp (c2m.first matches m,.a,.y,.*>); let slot = getSlot(state, a); // in a 2-level // system a has to be present in the memory let statea = state[slot]; if((ic, isCompatible(statea.dir[i],y)) && (statea.waitc[c]=No)) begin let d =(statea.dir[c]=I)? dataArray[slot]: -); m2c.enq(c, a, y, d>); state[slot].dir[c]:=y; IsCompatible(M, M) = False c2m.deq; IsCompatible(M, S) = False

end IsCompatible(S, M) = False endrule All other cases = True November 15, 2017 6.175 L22-19 Parent (Downgrade) Requests rule dwn (c2m.first matches m,.a,.y,.*>); let slot = getSlot(state, a); let statea = state[slot]; if (findChild2Dwn(statea) matches (Valid .i)) begin state[slot].waitc[i] <= Yes; m2c.enq(i, a, (y==M?I:S), ? >); end; Endrule // (4) This rule will execute as long some child cache is not compatible with the incoming request November 15, 2017 6.175 L22-20

Parent receives Response rule dwnRsp (c2m.first matches m, .a, .y, .data>); c2m.deq; let slot = getSlot(state, a); let statea = state[slot]; if(statea.dir[c]=M) dataArray[slot]<=data; state[slot].dir[c]<=y; state[slot].waitc[c]<=No; endrule // (6) November 15, 2017 6.175 L22-21 Child Responds rule dng ((mshr != Resp) &&& m2c.first matches c,.a,.y,.*>); let slot = getSlot(state,a); if(getCacheState(state[slot])>y) begin let d = (isStateM(state[slot])? dataArray[slot]: -); c2m.enq(m, a, y, d>); state[slot] <= (y,a); end // the address has already been downgraded m2c.deq;

endrule // (5) and (7) November 15, 2017 6.175 L22-22 Child Voluntarily downgrades rule startMiss(mshr == Ready); let slot = findVictimSlot(state); if(!isStateI(state[slot])) begin // write-back (Evacuate) let a = getAddr(state[slot]); let d = (isStateM(state[slot])? dataArray[slot]: -); state[slot] <= (I, _); c2m.enq(m, a, I, d>); end endrule // (8) Rules 1 to 8 are complete - cover all possibilities and cannot deadlock or violate cache invariants November 15, 2017 6.175 L22-23

Invariants for a CC-protocol design Directory state is always a conservative estimate of a childs state E.g., if directory thinks that a child cache is in S state then the cache has to be in either I or S state For every request there is a corresponding response, though sometimes it is generated even before the request is processed Communication system has to ensure that responses cannot be blocked by requests a request cannot overtake a response for the same address At every merger point for requests, we will assume fair arbitration to avoid starvation November 15, 2017 6.175 L22-24

Recently Viewed Presentations

  • History and Setting up Projects - Kennesaw State University

    History and Setting up Projects - Kennesaw State University

    Need scaffolding. What is OpenGL? "Open" (source) Graphics Language. ... Vertex Processing. Clipping/Assembly. Rasterization. Fragment Processing. Fragment Processing. Works on a pixel/fragment basis. Determining the color of each pixel. We write pixel shaders for this.
  • Pauls Third Journey Paul wrote Romans from Corinth

    Pauls Third Journey Paul wrote Romans from Corinth

    in mighty signs and wonders, by the power of the Spirit of God, so that from Jerusalem and round about to Illyricum . I have fully preached the gospel of Christ. Rom 15:20 . And so I have made it...
  • ADA Compliant Lecture PowerPoint

    ADA Compliant Lecture PowerPoint

    Forbidden toy remained highly attractive. No change in attitude. Had sufficient external justification for resisting toy. Aronson and Carlsmith (1963): When the experimenter returned, the children who had received a severe threat continued to rate the forbidden toy as highly...
  • Assignment Agreement.Title IV of the Intergovernmental ...

    Assignment Agreement.Title IV of the Intergovernmental ...

    Fringe Benefits (based on actual costs) The NPC will calculate this. 25. The NPC will submit monthly (quarterly, annually) an invoice to cost center #123 (These are VA's cost center that associate with the specific grant or project covered in...
  • Kinetic-Molecular Theory (KMT)

    Kinetic-Molecular Theory (KMT)

    Kinetic-Molecular Theory (KMT) Basic Principles of KMT Absolute temperature Pressure Diffusion What is KMT? * Based on the research of Robert Boyle (1627 - 1691) A theory that envisions molecules in motion Best describes properties and behaviors of gases *...
  • 2009 PE Update - Mintie

    2009 PE Update - Mintie

    Electronic Statement of Conditions (eSOC) Security Administrator Sites & Buildings: Revised Government Suspension BBI Enhancements After Gov't Suspension Sites & Building Page Emergency Management Overview Is now an accreditation manual chapter All Standards and Elements of Performance from 2008 are...
  • PPD Study Session

    PPD Study Session

    Ventura County SELPAPattern of Strengths and Weaknesses (PSW) Model: Psych TOT Training #4. Kim Charnofsky, Mental Health Facilitator, CVUSD. Jenny Jones, Director, VCOE
  • Chap. 9 Pipeline and Vector Processing

    Chap. 9 Pipeline and Vector Processing

    "A survey of cache coherence schemes for multiprocessors" Tightly coupled system MM CPUs cluster cluster cluster cluster cluster cluster cluster Crossbar- Hierarchies cluster cluster cluster cluster cluster cluster cluster cluster cluster Crossbar Node Node Node Node Node 4 Cluster PU...