Transactional Memory Implementation Lecture 1 COS597C, Fall 2010

Transactional Memory  Implementation Lecture 1 COS597C, Fall 2010

Transactional Memory Implementation Lecture 1 COS597C, Fall 2010 Princeton University Arun Raman 1 Module Outline Lecture 1 (THIS LECTURE) Transactional Memory System Taxonomy Software Transactional Memory Hardware Accelerated STMs Lecture 2 (Thursday) Hardware based TMs

Hybrid TMs When are transactions a good (bad) idea Assignment Compare lock-based and transaction-based implementations of a concurrent data structure 2 Resources [0] TL2 and STAMP Benchmark Suite --- For ASSIGNMENT http://stamp.stanford.edu/ [1] Transactional Memory, Concepts, Implementations, and Opportunities Christos Kozyrakis http://csl.stanford.edu/~christos/publications/2008.tm_tutorial.acaces.pdf [2] Transactional Memory - Larus and Rajwar Chapter 4: Hardware Transactional Memory

Lot of the material in this lecture is sourced from here [3] McRT-STM: A High Performance Software Transactional Memory System for a Multi-core Runtime http://portal.acm.org/ft_gateway.cfm?id=1123001&type=pdf&CFID=109418483&CFTOKEN=70706278 [4] Architectural Support for Software Transactional Memory (Hardware Accelerated STM) - Saha et al. http://www.computer.org/portal/web/csdl/doi/10.1109/MICRO.2006.9 [5] Signature-Accelerated STM (SigTM) http://portal.acm.org/citation.cfm?id=1250673 [6] Case Study: Early Experience with a Commercial Hardware Transactional Memory Implementation [ASPLOS '09] http://delivery.acm.org/10.1145/1510000/1508263/p157-dice.pdf?key1=1508263&key2=2655478821&coll=GUIDE&dl=GUIDE&CFID=1113 87689&CFTOKEN=28498606

[7] The Transactional Memory / Garbage Collection Analogy - Dan Grossman http://portal.acm.org/ft_gateway.cfm?id=1297080&type=pdf&coll=GUIDE&dl=GUIDE&CFID=111387606&CFTOKEN=67012128 [8] Subtleties of Transactional Memory Atomicity Semantics - Blundell et al. http://www.cis.upenn.edu/acg/papers/cal06_atomic_semantics.pdf 3 Lecture Outline 1. Recap of Transactions 2. Transactional Memory System Taxonomy 3. Software Transactional Memory 4. Hardware Accelerated STM 4 Lecture Outline

1. Recap of Transactions 2. Transactional Memory System Taxonomy 3. Software Transactional Memory 4. Hardware Accelerated STMs 5 Parallel Programming 1. Find independent tasks in the algorithm 2. Map tasks to execution units (e.g. threads) 3. Define and implement synchronization among tasks 1. Avoid races and deadlocks, address memory model issues, 4. Compose parallel tasks 5. Recover from errors

6. Ensure scalability 7. Manage locality 8. 6 Parallel Programming 1. Find independent tasks in the algorithm 2. Map tasks to execution units (e.g. threads) 3. Define and implement synchronization among tasks 1. Avoid races and deadlocks, address memory model issues, Transactional Memory 4. Compose parallel tasks 5. Recover from errors

6. Ensure scalability 7. Manage locality 8. 7 Transactional Programming void deposit(account, amount) { lock(account); int t = bank.get(account); t = t + amount; bank.put(account, t); unlock(account); }

void deposit(account, amount) { atomic { int t = bank.get(account); t = t + amount; bank.put(account, t); } } 1. Declarative Synchronization What, not How 2. System implements Synchronization transparently 8 Transactional Memory Memory Transaction - An atomic and isolated sequence of

memory accesses Transactional Memory Provides transactions for threads running in a shared address space 9 Transactional Memory - Atomicity Atomicity On transaction commit, all memory updates appear to take effect at once; on transaction abort, none of the memory updates appear to take effect Thread 1 void deposit(account, amount) { atomic { int t = bank.get(account); RD t = t + amount;

bank.put(account, t); WR } } RD A : 0 Thread 2 CONFLICT RD A : 0 WR A : 10 ABORT

WR A : 5 COMMIT 10 Transactional Memory - Isolation Isolation No other code can observe updates before commit Programmer only needs to identify operation sequence that should appear to execute atomically to other, concurrent threads 11 Transactional Memory - Serializability

Serializability Result of executing concurrent transactions on a data structure must be identical to a result in which these transactions executed serially. 12 Some advantages of TM 1. Ease of use (declarative) 2. Composability 3. Expected performance of fine-grained locking 13 Composability : Locks void transfer(A, B, amount) {

synchronized(A) { synchronized(B) { withdraw(A, amount); deposit(B, amount); } } } void transfer(B, A, amount) { synchronized(B) { synchronized(A) { withdraw(B, amount); deposit(A, amount); } }

} 1. Fine grained locking Can lead to deadlock 2. Need some global locking discipline now 14 Composability : Locks void transfer(A, B, amount) { synchronized(bank) { withdraw(A, amount); deposit(B, amount); } }

void transfer(B, A, amount) { synchronized(bank) { withdraw(B, amount); deposit(A, amount); } } 1. Fine grained locking Can lead to deadlock 2. Coarse grained locking No concurrency 15 Composability : Transactions void transfer(A, B, amount) { atomic {

withdraw(A, amount); deposit(B, amount); } } void transfer(B, A, amount) { atomic { withdraw(B, amount); deposit(A, amount); } } 1. Serialization for transfer(A,B,100) and transfer(B,A,100) 2. Concurrency for transfer(A,B,100) and transfer(C,D,100)

16 Some issues with TM 1. I/O and unrecoverable actions 2. Atomicity violations are still possible 3. Interaction with non-transactional code 17 Atomicity Violation Thread 1 atomic { ptr = A;

} Thread 2 atomic { ptr = NULL; } atomic { B = ptr->field; } 18 Interaction with non-transactional code

Thread 1 lock_acquire(lock); obj.x = 1; if (obj.x != 1) fireMissiles(); lock_release(lock); Thread 2 obj.x = 2; 19 Interaction with non-transactional code Thread 1 atomic {

obj.x = 1; if (obj.x != 1) fireMissiles(); } Thread 2 obj.x = 2; 20 Interaction with non-transactional code Thread 1 atomic { obj.x = 1; if (obj.x != 1)

fireMissiles(); } Thread 2 obj.x = 2; Weak Isolation Transactions are serializable only against other transactions Strong Isolation Transactions are serializable against all memory accesses (Non-transactional LD/ST are 1instruction TXs) 21 Nested Transactions void transfer(A, B, amount) {

atomic { withdraw(A, amount); deposit(B, amount); } } void deposit(account, amount) { atomic { int t = bank.get(account); t = t + amount; bank.put(account, t); } } Semantics of Nested Transactions

Flattened Closed Nested Open Nested 22 Nested Transactions - Flattened int x = 1; atomic { x = 2; atomic flatten { x = 3; abort; } }

23 Nested Transactions - Closed int x = 1; atomic { x = 2; atomic closed { x = 3; abort; } } 24

Nested Transactions - Open int x = 1; atomic { x = 2; atomic open { x = 3; } abort; } 25 Nested Transactions Open Use Case int counter = 1; atomic {

atomic open { counter++; } } 26 Nested Transactions Nested Parallelism! Intelligent Speculation for Pipelined Multithreading Neil Vachharajani, PhD Thesis, Princeton University 27

Transactional Programming - Summary 1. Transactions do not generate parallelism 2. Transactions target performance of fine-grained locking @ effort of coarse-grained locking 3. Various constructs studied previously (atomic, retry, orelse,) 4. Different semantics (Weak/Strong Isolation, Nesting) 28 Lecture Outline 1. Recap of Transactions 2. Transactional Memory System Taxonomy 3. Software Transactional Memory 4. Hardware Accelerated STMs

29 TM Implementation Data Versioning Eager Versioning Lazy Versioning Conflict Detection and Resolution Pessimistic Concurrency Control Optimistic Concurrency Control Conflict Detection Granularity Object Granularity Word Granularity Cache line Granularity 30

TM Implementation Data Versioning Eager Versioning Lazy Versioning Conflict Detection and Resolution Pessimistic Concurrency Control Optimistic Concurrency Control Conflict Detection Granularity Object Granularity Word Granularity Cache line Granularity 31 Data Versioning

Eager Versioning (Direct Update) Lazy Versioning (Deferred Update) 32 TM Implementation Data Versioning Eager Versioning Lazy Versioning Conflict Detection and Resolution Pessimistic Concurrency Control Optimistic Concurrency Control Conflict Detection Granularity Object Granularity

Word Granularity Cache line Granularity 33 Conflict Detection and Resolution - Pessimistic Conflict with Stall Conflict with Abort Time No Conflict 34

Conflict Detection and Resolution - Optimistic Conflict with Abort Conflict with Commit Time No Conflict 35 TM Implementation Data Versioning Eager Versioning Lazy Versioning

Conflict Detection and Resolution Pessimistic Concurrency Control Optimistic Concurrency Control Conflict Detection Granularity Object Granularity Word Granularity Cache line Granularity 36 Examples Hardware TM Stanford TCC: Lazy + Optimistic Intel VTM: Lazy + Pessimistic Wisconsin LogTM: Eager + Pessimistic UHTM

SpHT Software TM Sun TL2: Lazy + Optimistic (R/W) Intel STM: Eager + Optimistic (R)/Pessimistic (W) MS OSTM: Lazy + Optimistic (R)/Pessimistic (W) Draco STM STMLite DSTM Can find many more at http://www.dolcera.com/wiki/index.php?title=Transactional_memory 37 Lecture Outline 1. Recap of Transactions 2. Transactional Memory System Taxonomy 3. Software Transactional Memory (Intel McRT-STM)

4. Hardware Accelerated STMs 38 Software Transactional Memory (STM) atomic { a.x = t1 a.y = t2 if (a.z == 0) { a.x = 0 a.z = t3 } }

tmTXBegin() tmWr(&a.x, t1) tmWr(&a.y, t2) if (tmRd(&a.z) != 0) { tmWr(&a.x, 0) tmWr(&a.z, t3) } tmTXCommit() 39 Intel McRT-STM Strong or Weak Isolation

Weak Transaction Granularity Word or Object Lazy or Eager Versioning Eager Concurrency Control Optimistic read, Pessimistic Write

Nested Transaction Closed 40 McRT-STM Runtime Data Structures Transaction Descriptor (per thread) Used for conflict detection, commit, abort, Includes read set, write set, undo log or write buffer Transaction Record (per datum) Pointer-sized record guarding shared datum Tracks transactional state of datum Shared: Read-only access by multiple readers Value is odd (low bit is 1)

Exclusive: Write-only access by single owner Value is aligned pointer to owning transactions descriptor 41 McRT-STM: Example T1 atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; }

Class Foo { int x; int y; }; Foo bar, foo; T2 atomic { t1 = bar.x; t2 = bar.y; } T1 copies foo into bar T2 reads bar, but should not see intermediate values 42

McRT-STM: Example T1 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit(); T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y);

stmCommit(); T1 copies foo into bar T2 reads bar, but should not see intermediate values 43 McRT-STM Operations STM read (Optimistic) Direct read of memory location (eager versioning) Validate read data Check if unlocked and data version <= local timestamp If not, validate all data in read set for consistency validate() {for in transactions read set, if (*txnrec != ver) abort();}

Insert in read set Return value STM write (Pessimistic) Validate data Check if unlocked and data version <= local timestamp Acquire lock Insert in write set Create undo log entry Write data in place (eager versioning) 44 McRT-STM: Example foo Commit

T1 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit; 3 hdr x=9 y=7 7 5 T1

hdr xx==09 yy==07 T2 waits bar Abort T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();

Reads Reads Writes should read [0, 0] or should read [9,7] Undo 45 Lecture Outline 1. Recap of Transactions 2. Transactional Memory System Taxonomy 3. Software Transactional Memory (Intel McRT-STM) 4. Hardware Accelerated STMs (HASTM) 46

Hardware Support? 47 Types of Hardware Support Hardware-accelerated STM systems (HASTM, SigTM, ) Hardware-based TM systems (TCC, LTM, VTM, LogTM, ) Hybrid TM systems (Sun Rock) 48 Hardware Support? 1. 1.8x 5.6x slowdown over sequential 2. Most time goes in read barriers and validation

49 Hardware Support? Architectural Support for Software Transactional Memory Saha et al. 50 Hardware-accelerated STM (HASTM) Hardware Support Per-thread mark bits at granularity of cache lines New register (mark counter) Used to build fast filters to speed up read barriers

ISA Extensions loadSetMark(addr) Loads value at addr and sets mark bit associated with addr loadResetMark(addr) Loads value at addr and clears mark bit associated with addr loadTestMark(addr) Loads the value at addr and sets carry flag to value of mark bit resetMarkAll() Clears all mark bits in cache and increments mark counter resetMarkCounter() Resets mark counter readMarkCounter() Reads mark counter value 51 HASTM Cache line transitions

52 McRT-STM Operations stmRd = stmRdBar + Rd stmRdBar(TxnRec* rec) { void* recval = *rec; void* txndesc = getTxnDesc(); if (recval == txndesc) return; if (isVersion(recVal) == false) recval = handleContention(rec); logRead(rec,recval); } 53

McRT-STM Operations stmWr = stmWrBar + LogOldValue + Wr stmWrBar(TxnRec* rec) { void* recval = *rec; void* txndesc = getTxnDesc(); i f (recval == txndesc) return; if (isVersion(recVal) == false) recval = handleContention(rec); while (CAS(rec,recval,txndesc) == false) recval = handleContention(rec); logWrite(rec,recval); } 54 McRT-STM Operations

stmRd = stmRdBar + Rd mov eax, [rec] /* load TxRec */ cmp eax, txndesc /* do I own exclusive */ jeq done test eax, #versionmask /* is a version no. */ jz contention mov ecx, [txndesc + rdsetlog] /*get log ptr*/ test ecx, #overflowmask jz overflow add ecx, 8 /*inc log ptr*/ mov [txndesc + rdsetlog], ecx mov [ecx 8], rec /* logging */ mov [ecx 4], eax /* logging */

done: Manage Contention Manage Overflow Fast Path Slow Path 55 HASTM Operations stmRd = stmRdBar + Rd

loadTestMark eax, [rec] /* check 1st access */ jnae done loadSetMark eax, [rec] test eax, #versionmask /* is a version no. */ jz contention mov ecx, [txndesc + rdsetlog] /*get log ptr*/ test ecx, #overflowmask jz overflow add ecx, 8 /*inc log ptr*/ mov [txndesc + rdsetlog], ecx mov [ecx 8], rec /* logging */ mov [ecx 4], eax /* logging */ done:

Manage Contention Manage Overflow Fast Path Slow Path 56 HASTM Operations validate validate() { markCount = readMarkCounter();

resetMarkAll(); if (markCount == 0) /*no snoop or eviction*/ return; /* perform full read set validation */ for in transactions read set if (*txnrec != ver) abort(); } 57 HASTM Performance Improvement 58 HASTM Performance Improvement Multicore

Binary Search Tree B-Tree 59 HASTM - Issues Insufficient Cache Capacity Mark bits are only an acceleration mechanism Cache evictions, cause mark bits to be lost revert to slow software validation Weak Isolation 60

Other Slides 61 Intel McRT-STM Lazy/Eager Eager 62 Other Transactional Programming Primitives 1. User-triggered abort: abort

void transfer(A,B,amount) atomic{ try{ work(); } catch(error1) {fix_code();} catch(error2) {abort;} } 2. Conditional synchronization: retry Object blockingDequeue(q) atomic{ if(q.isEmpty()) retry; return dequeue(); }

3. Composing code sequences: orelse atomic{q1.blockingDequeue(); }orelse{q2.blockingDequeue(); } 63

Recently Viewed Presentations

  • 2016 COA Session 1 Slides - Critical Path Institute

    2016 COA Session 1 Slides - Critical Path Institute

    Federal Food, Drug & Cosmetic Act (FD&C Act) Prescription drug promotion . must … Not be false or misleading. Have fair balance. Be consistent with the approved product labeling. Include claims substantiated by adequate and well-controlled clinical studies. The law...
  • Tapyba - WordPress.com

    Tapyba - WordPress.com

    Kautynės. Siužetinis piešinys iš Morella la Vella Ispanijoje. 8000-5000 m. pr. Kr. Egipto tapyboje susiformuoja pirmasis tapybos kanonas - tiek žmogų, tiek daiktus vaizduoti iš atpažįstamiausio rakurso. Hesyro portretas, XXVIII a. pab. - XXVII a. pr. prieš Kr.
  • Electrostatics - Tennessee State University

    Electrostatics - Tennessee State University

    In the line representation: flux is the number of field lines crossing over a given surface Since the field line density is proportional to electric field, the number of field lines should be electric field integrate over the surface Gauss's...
  • Statewide Quality Advisory Committee (SQAC) Meeting

    Statewide Quality Advisory Committee (SQAC) Meeting

    Contractual settlement for risk-bearing providers (~June) At present, there is no easy way to collect outcomes data from provider organizations, so payers have developed various mechanisms which vary by: Patient population
  • Nomenclature - Madison County School District

    Nomenclature - Madison County School District

    Define cation and anion. Write the electron configurations of 5 elements of your choosing. What is a molecule? Predicting Oxidation Numbers. Electrons involved in reaction are the outer and highest energy electrons (Valence electrons) ... Nomenclature Last modified by:
  • Chapter 4: Factoring Polynomials

    Chapter 4: Factoring Polynomials

    The first step in factoring a polynomial is to find the GCF of all its terms. Then we write the polynomial as a product by . factoring out. the GCF from all the terms. The remaining factors in each term...
  • Welcome DVT Prophylaxis in Orthopaedic Surgery Symposium, American

    Welcome DVT Prophylaxis in Orthopaedic Surgery Symposium, American

    Welcome DVT Prophylaxis in Orthopaedic Surgery Symposium, American Academy of Orthopedic Surgeons, Annual Meeting, San Diego, California Educational Objectives Learn how to explain current guidelines issued by national professional organizations and colleges, such as the AAOS, ACCP, and ASA, mandating...
  • Protein Structure - sta.uwi.edu

    Protein Structure - sta.uwi.edu

    Introduction. Anthocyanins are water-soluble vacuolar pigments. Occur in all tissues of higher plants, eg. leaves, stems, roots, flowers, fruits