Parallel and Distributed Systems

Parallel and Distributed Systems

Shared memory architectures Shared memory architectures Multiple CPUs (or cores) One memory with a global address space May have many modules All CPUs access all memory through the global address space All CPUs can make changes to the

shared memory Changes made by one processor are visible to all other processors? Data parallelism or function parallelism? Shared memory architectures How to connect CPUs and memory? Shared memory architectures

One large memory One the same side of the interconnect Mostly Bus Memory reference has the same latency Uniform memory access (UMA) Many small memories Local and remote memory Memory latency is different Non-uniform memory access (NUMA)

UMA Shared memory architecture (mostly bus-based MPs) Many CPUs and memory modules connect to the bus dominates server and enterprise market, moving down to desktop Faster processors began to saturate bus, then bus technology advanced

today, range of sizes for bus-based systems, desktop to large servers (Symmetric Multiprocessor (SMP) machines). Bus bandwidth in Intel systems Front side bus(FSB) bandwidth in Intel systems Pentium D 133 MHz200 MHz

4 64-bit 4256 MB/s-6400 MB/s Pentium Extreme Edition 200 MHz266 MHz

4 64-bit 6400 MB/s-8512 MB/s Pentium M 100 MHz133 MHz

4 64-bit 3200 MB/s-4256 MB/s Core Solo 133 MHz166 MHz

4 64-bit 4256 MB/s-5312 MB/s Core Duo 133 MHz166 MHz

4 64-bit 4256 MB/s-5312 MB/s Core 2 Solo 133 MHz200 MHz

4 64-bit 4256 MB/s-6400 MB/s Core 2 Duo 133 MHz333 MHz

4 64-bit 4256 MB/s-10656 MB/s Core 2 Quad 266 MHz333 MHz

4 64-bit 8512 MB/s-10656 MB/s Core 2 Extreme 200 MHz400 MHz

4 64-bit 6400 MB/s-12800 MB/s NUMA Shared memory architecture Identical processors, processors have different time for

accessing different part of the memory. Often made by physically linking SMP machines (Origin 2000, up to 512 processors). The current generation SMP interconnects (Intel Common System interface (CSI) and AMD hypertransport) have this flavor, but the processors are close to each other. Various SMP hardware organizations Cache coherence problem

Due to the cache copies of the memory, different processors may see the different values of the same memory location. Processors see different values for u after event 3. With a write-back cache, memory may store the stale date. This happens frequently and is unacceptable to applications. Bus Snoopy Cache Coherence protocols Memory: centralized with uniform access time and bus interconnect.

Example: All Intel MP machines like diablo Bus Snooping idea Send all requests for data to all processors (through the bus) Processors snoop to see if they have a copy and respond accordingly. Cache listens to both CPU and BUS. The state of a cache line may change by (1) CPU memory operation, and (2) bus transaction (remote CPUs memory operation).

Requires broadcast since caching information is at processors. Bus is a natural broadcast medium. Bus (centralized medium) also serializes requests. Dominates small scale machines. Types of snoopy bus protocols Write invalidate protocols Write to shared data: an invalidate is sent to the bus (all

caches snoop and invalidate copies). Write broadcast protocols (typically write through) Write to shared data: broadcast on bus, processors snoop and update any copies. An Example Snoopy Protocol (MSI) Invalidation protocol, write-back cache Each block of memory is in one state

Clean in all caches and up-to-date in memory (shared) Dirty in exactly one cache (exclusive) Not in any cache Each cache block is in one state: Shared: block can be read Exclusive: cache has only copy, its writable and dirty Invalid: block contains no data. Read misses: cause all caches to snoop bus (bus transaction) Write to a shared block is treated as misses (needs bus

transaction). MSI protocol state machine for CPU requests MSI protocol state machine for Bus requests MSI protocol state machine (combined)

Some snooping cache variations Basic Protocol Three states: MSI. Can optimize by refining the states so as to reduce the bus transactions in some cases. Berkeley protocol Five states, M owned, exclusive, owned shared. Illinois protocols (five states) MESI protocol (four states)

M modified and Exclusive. Used by Intel MP systems. Multiple levels of caches Most processors today have on-chip L1 and L2 caches. Transactions on L1 cache are not visible to bus (needs separate snooper for coherence, which would be expensive). Typical solution: Maintain inclusion property on L1 and L2 cache so that

all bus transactions that are relevant to L1 are also relevant to L2: sufficient to only use the L2 controller to snoop the bus. Propagating transactions for coherence in the hierarchy. Large share memory multiprocessors The interconnection network is usually not a bus. No broadcast medium cannot snoop. Needs a different kind of cache coherence protocol.

Basic idea Use a similar idea of snoopy bus Snoopy bus with the MSI protocol Cache line has three states (M, S, and I) Whenever we need a cache coherence operation, we tell the bus (central authority). CC protocol for large SMPs Cache line has three states Whenever we need a cache coherence operation, we tell the

central authority serializes the access performs the cache coherence operations using point-to-point communication. It needs to know who has a cache copy, this information is stored in the directory. Cache coherence for large SMPs Use a directory for each cache line to track the state of every block in the cache. Can also track the state for all memory blocks

directory size = O(memory size). Need to used distributed directory Centralized directory becomes the bottleneck. Who is the central authority for a given cache line? Typically called cc-NUMA multiprocessors ccNUMA multiprocessors Directory based cache coherence

protocols Similar to snoopy protocol: three states Shared: > 1 processors have the data, memory up-todate Uncached: not valid in any cache Exclusive: 1 processor has data, memory out-of-date Directory must track: Cache state Which processors have data when it is in shared state Bit vector, 1 if a particular processor has a copy Id and bit vector combination

Directory based cache coherence protocols No bus and do not want to broadcast Typically 3 processors involved: Local node where a request originates Home node where the memory location of an address resides (this is the central authority for the page) Remote node has a copy a cache block (exclusive or shared)

Directory protocol messages example Directory based CC protocl in action Local node (L): WriteMiss(P, A) to home node Home node: cache line in shared state at processors P1, P2, P3 Home node to P1, P2, P3: invalidate(P, A)

Home node: cache line in exclusive state at processor L. Summary Share memory architectures UMA and NUMA Bus based systems and interconnect based systems Cache coherence problem Cache coherence protocols

Snoopy bus Directory based

Recently Viewed Presentations

  • Caring for You while Caring for Others: A training on ...

    Caring for You while Caring for Others: A training on ...

    Now that we have discussed our personal testimonies, let's take a look at an example case study to see if we can relate to this example and if any warning signs pop out to any of us. ... There are...
  • Sample: National Early Warning Score and associated Education ...

    Sample: National Early Warning Score and associated Education ...

    Project Context. The National Early Warning Score initiative is a work stream of the Acute Medicine Programme in association with other Clinical Programmes, Quality & Patient Safety, Office of the Nursing and Midwifery Services Director, Clinical Indemnity Scheme, the Assistant...
  • The Hypotenuse & Peace - LAWR

    The Hypotenuse & Peace - LAWR

    Conflict power laws. ... 50-50no holes. level zero. Our Lord Jesus Christ. the . way, the truth, and the life. the. hypotenuse . is indeed the path of . peace! The unique geometric solution. only straight and solid condition without...
  • Eighth Edition CHAPTER 3 VECTOR MECHANICS FOR ENGINEERS:

    Eighth Edition CHAPTER 3 VECTOR MECHANICS FOR ENGINEERS:

    If the force tends to rotate the structure clockwise the sense of the moment vector is into the plane of the structure and the magnitude of the moment is negative. 3 - * Varignon's Theorem The moment about a give...
  • Gold Bricks - Weebly

    Gold Bricks - Weebly

    I will also designate a student to call 911. 28 27 26 25 Empty Braeden McCollum Empty 21(end) 23 (end) Christina Johnson Brittany Starr HOME STATION 2 22 (end) 24 (end) Destiny Crossin Da'Nia Robinson 20 19 18 17 Ashley...
  • Chapter 3

    Chapter 3

    Character Education (2006) by Daniel Lapsley and Darcia Narvaez. In W. Damon and Richard Lerner (Eds.), Handbook of child psychology (6th Ed.). New York: Wiley. A very up-to-date evaluation of moral education by leading experts. Cults (1999) by Marc Galanter....
  • CSE 374 Programming Concepts & Tools

    CSE 374 Programming Concepts & Tools

    Design exercise #1. Write a typing-break reminder program. Offer the hard-working user occasional reminders of the perils of Repetitive Strain Injury, and encourage the user to take a break from typing.
  • WisDOT Labor Compliance Laurie Dolsen Labor Compliance Specialist/Team

    WisDOT Labor Compliance Laurie Dolsen Labor Compliance Specialist/Team

    NW. NE. SW. SE. 04/03/18 * Who . we are: broadly - OBOEC (Office of Business Opportunity & Equity Compliance) staff throughout regions *Labor Compliance Specialist - monitor state let construction projects (not locally let) *Who will be the LCS...