How Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons Carnegie Mellon University PODC16 Keynote Talk, July 28, 2016 50 Years of Algorithms Research has focused on settings in which reads & writes to memory have equal cost But what if they have very DIFFERENT costs? How would that impact Algorithm Design? ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 3 Key Take-Aways Main memory will be persistent and asymmetric Reads much cheaper than Writes

Very little work to date on Asymmetric Memory Not quite: space complexity, CRQW, RMRs, Flash Highlights of our results to date: Models: (M,)-ARAM; with parallel & block variants)-ARAM; with parallel & block variants Asymmetric memory is not like symmetric memory New techniques for old problems ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 4 Emerging Memory Technologies ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 5 Emerging Memory Technologies Motivation: DRAM (todays main memory) is volatile

DRAM energy cost is significant (~35% of DC energy) DRAM density (bits/area) is limited Promising candidates: Phase-Change Memory (PCM) Spin-Torque Transfer Magnetic RAM (STT-RAM) Memristor-based Resistive RAM (ReRAM) Conductive-bridging RAM (CBRAM) 3D XPoint Key properties: Persistent, significantly lower energy, can be higher density Read latencies approaching DRAM, byte-addressable ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 6

Another Key Property: Writes More Costly than Reads In these emerging memory technologies, bits are stored as states of the given material No energy to retain state Small energy to read state - Low current for short duration Large energy to change state PCM - High current for long duration Writes incur higher energy costs, higher latency, lower per-DIMM bandwidth (power envelope Phillip B. Gibbons ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design 7 Cost Examples from Literature (Speculative) PCM: writes 15x slower, 15x less BW, 10x more

energy PCM L3 Cache: writes up to 40x slower,17x more energy STT-RAM cell: writes 71x slower, 1000x more energy @ material level ReRAM DIMM: writes 117x slower, 125x more energy CBRAM: writes 50x more energy Costs are a well-kept secret by Vendors Sources: [KPMWEVNBPA14] [DJX09] [XDJX11] [GZDCH13] ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 8 Are We There Yet? 3D XPoint will first come out in SSD form factor No date announced: expectation is 2017

Later will come out in DIMM form factor Main memory: Loads/Stores on memory bus No date announced: perhaps 2018 Energy/density/persistence advantages Projected to become dominant main In near future: memory Main memory will be persistent & asymmetric ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 9 Write-Efficient Algorithm Design Goal: Design write-efficient algorithms (write-limited, write-avoiding) Fewer writes Lower energy, Faster

How we model the asymmetry: In asymmetric memory, writes are times more costly than reads ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 10 Warm up: Write-efficient Sorting How does one sort elements using instructions (reads) but only writes? Swap-based sorting (i.e. quicksort, heap sort) does writes Mergesort requires writes for rounds Solution: Insert each key in random order into a binary search tree An in-order tree traversal yields

the sorted array ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 11 Asymmetric Read-Write Costs: Prior Work (1) Space complexity classes such as L Can repeatedly read input Only limited amount of working space Whats missing: Doesnt charge for number of writes OK to write every step Similarly, streaming algorithms have limited space ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 12 Asymmetric Read-Write Costs:

Prior Work (2) Reducing writes to contended shared memory vars Multiprocessor cache coherence serializes writes, but reads can occur in parallel Concurrent-read-queue-write (CRQW) model [GMR98] Contention in asynchronous shared memory algs [DHW97] Etc, etc Whats missing: Cost of writes to even uncontended vars OK to write every step to disjoint vars (disjoint cache lines) ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 13 Asymmetric Read-Write Costs: Prior Work (3) Remote Memory References (RMR) [YA95] Only charge for remote memory references, i.e., references that require an interconnect traversal

In cache-coherent multiprocessors, only charge for: A read(x) that gets its value from a write(x) by another process A write(x) that invalidates a copy of x at another process Thus, writes make it costly Whats missing: Doesnt charge for number of writes ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 14 Asymmetric Read-Write Costs: Prior Work (4) NAND Flash. This work focused on: Asymmetric granularity of writes (must erase large blocks) Asymmetric endurance of writes [GT05, EGMP14] No granularity issue for emerging NVM Byte-addressable for both reads and writes Individual cell endurance not big issue for emerging NVM Can be handled by system software

ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 15 Key Take-Aways Main memory will be persistent and asymmetric Reads much cheaper than Writes Very little work to date on Asymmetric Memory Are Here Not quite: space complexity, CRQW, You RMRs, Flash, Highlights of our results to date: Models: (M,)-ARAM; with parallel & block variants)-ARAM; with parallel & block variants Asymmetric memory is not like symmetric memory New techniques for old problems ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design

Phillip B. Gibbons 16 -Asymmetric RAM (ARAM) [BFGGS16] Comprised of: processor executing RAM instructions on -bit words a symmetric memory of words an asymmetric memory of unbounded size, with write cost Asymmetric Memory ARAM cost Q(n): Symmetric Memory CP U 1 1

write cost words Time T(n) = Q(n) + # of instructions ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 17 Write-efficient Algorithms Problem Read (unchange d) Previous write Current write Reduction

ratio Comparison sort Search tree, priority queue 2D convex hull, triangulation BFS, DFS, topological sort, bi-CC, SCC zzzz Trivial Significant reduction. M can be O(1) ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 18 Reduction in Writes depends on M, input size

ARAM cost Problem Classic algorithms New algorithms Singlesource shortestpath Minimum spanning tree SSSP: Phased Dijkstra that uses phases & keeps a truncated priority queue in symmetric memory Write-efficient bookkeeping is often challenging ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 19

No (significant) improvement with cheaper reads Problem ARAM cost Q(n) Classic New Lower Algorithm Bound Sorting networks and Fast Fourier Transform Diamond DAG (ala LCS, edit distance) New FFT lower bound technique (generalizes [HK81]) Gap between comparison sorting & sorting networks No gap for classic RAM setting, PRAM, etc ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design

Phillip B. Gibbons 20 An Example of a Diamond DAG: Longest Common Subsequence C (LCS) G A T A T A T C G A T ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 21 An Example of a Diamond DAG: Longest Common Subsequence

C (LCS) G A T A T A 1 1 1 1 1 1 T 1 1 1

2 2 2 C 1 2 2 2 2 2 G 1 2

3 3 3 3 A 1 2 3 3 4 4 T 1 2

3 4 4 5 ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 22 Proof sketch of diamond DAG lower bound diamond requires storage to compute [CS76] Computing any diamond requires writes to the asymmetric memory storage space, from symmetric memory Tiling with sub-DAGs yields tiles

ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design rows Phillip B. Gibbons 23 Asymmetric Memory is not like Symmetric Memory Problem ARAM cost Q(n) Classic New Lower Algorithm Bound Sorting networks and Fast Fourier Transform Diamond DAG (ala LCS, edit distance)

DAG rule: Compute a node after all its inputs ready By breaking this rule: LCS cost reduced by New path sketch technique Classic RAM: No gap between Diamond DAG & LCS/Edit Distance ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 24 Asymmetric Shared Memory -Asymmetric PRAM (machine model)[BFGGS15] P processors, each with local memory of size Unbounded asymmetric shared memory, write cost Asymmetric Nested-Parallel Model [BBFGGMS16] Processor oblivious Provably good with work-stealing schedulers ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design

Phillip B. Gibbons 25 Reduce on Asymmetric PRAM Model Reduce(list L, function F, identity I){ if(L.length == 0){ return I; } if(L.length == 1){ return L[0]; } L1, L2 = split(L); R1 = Reduce(L1, F, I); R2 = Reduce(L2, F, I); return F(R1, R2); } Assume work for F Each write costs Split takes work Work

Span Too conservative: All intermediate results written to shared memory Must explicitly schedule computation on processors 26 Asymmetric Nested-Parallel (NP) Model: Fork-Join Parent Fork Child 1 Root Child Suspended 2 Join Parent 27 Asymmetric NP Model: Memory Model CPU Unit cost per

instructio n 1 1 Stack Memory 1 Stack Allocated Symmetri c Limited Size Heap Memory Heap Allocated Asymmetr ic Infinite Size

Key feature: Algorithmic cost model Processor-oblivious 28 Asymmetric NP Model: Stack Memory Root k ac t s S ess C cc A A C Constant Stack Memory D B

Ml Stack Memory A Constant Stack Memory Root 29 Asymmetric NP Model: Work Stealing Issues Thieves need access to stack memory of stolen task Good news: Non-leaf stacks are size Approach 1: Write out all stack memory every fork Have to pay for each fork! Approach 2: Write stack memory only on

Heap steal Challenge: Need number stacks written per Stack 1 to limit Stack 2 ofStack 3 steal Proc 1 Proc 2 30 Proc 3 Write Stack Memory Only on Steal: Problem Scenario

Roo t A B C Thread 2 Steal D E Thread 1 started at root, now executing G F G H 31 Thread 3 Steal

Asymmetric NP Model: Work Stealing Theorem computation with binary branching factor on A the Asymmetric NP model can be simulated on the (M, )-Asymmetric PRAM machine model )-Asymmetric PRAM machine model in: Expected Time where: P = processors = nesting depth D = span = leaf stack memory W = work M= 32 Reduce: Asymmetric NP Model Reduce(list L, function F, identity I){ if(L.length == 0){ return I; } if(L.length == 1){ return L[0]; }

L1, L2 = split(L); R1 = Reduce(L1, F, I); R2 = Reduce(L2, F, I); return F(R1, R2); } Assume work for F Minimize writes to large memory Children are forked tasks Tasks store list start & end Only write final answer Work Span Intermediate results NOT written to memory eduler handles inter-processor communication & its co

33 Write-Efficient Shared Memory Algorithms Problem Work (W) Span (D) Reduction of Writes Reduce Ordered Filter List Contraction Tree Contraction Minimum Spanning Tree 2D Convex Hull Output Sensitive BFS Tree = output size = graph diameter = inverse Ackerman function

= with high probability = expected [BBFGGMS16] 34 Tree Contraction: Current Methods Rake leaves Compress Chains x + 3 Each rake or + 2 compress operation costs a write Total number of rakes and compresses is Work is

Span is x 1x 3+ + 1 + 1 4 35 Tree Contraction: High-level Approach Assume that M is L Partition the tree into components of size Sequentially contract each component Use a classic parallel algorithm to contract

the resulting tree of size 36 Tree Contraction: Classic Partitioning Follow the Euler Tour 1 5 Generate subtree size 7 7 Find the m-critical 3 1 3 1 1 3

1 1 points 3 1 1 Partition the tree 1 Requires a write for each node 37 Tree Contraction: Write-Efficient Partitioning Mark each node with 38 probability Traverse the Euler

Tour from each marked node and mark every th node Mark the highest node on each path between marked nodes Each marked node starts a new component Tree Contraction: Contract-ability of Partitions = Unmarked node Marked in step 1 = = Marked in step 39 Tree Contraction: A New Approach Assume that ML is Partition the tree into components of size Sequentially contract each component Use a classic algorithm to contract the resulting tree of size

Work: Span: 40 The Asymmetric External Memory [BFGGS15] modeltransfer AEM has two memory instructions: Read transfer: load a block from large-memory Write transfer: write a block to large-memory The complexity of an algorithm on the AEM model Asymmetric (I/O complexity) is measuredLarge-memory by: Small-memory 1 CPU 0 1

/ ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 41 Sorting algorithms on the Asymmetric EM model Sorting records in AEM model has I/O complexity of can be achieved by: Multi-way mergesort Sample sort Heap sort based on buffer trees Matching lower bound [Sitchinava16] No asymptotic advantage whenever is for a constant c Depressingbecause so many problems cant beat an EM sorting lower bound ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design

Phillip B. Gibbons 42 Key Take-Aways Main memory will be persistent and asymmetric Reads much cheaper than Writes Very little work to date on Asymmetric Memory Not quite: space complexity, CRQW, RMRs, Flash, Highlights of our results to date: Models: (M,)-ARAM; with parallel & block variants)-ARAM; with parallel & block variants Asymmetric memory is not like symmetric memory You Are Here New techniques for old problems ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 43 Thanks to Collaborators

Naama Ben-David Guy Blelloch Jeremy Fineman Yan Gu Charles McGuffey Julian Shun (Credit to Yan and Charlie for some of these slides) & Sponsors

National Science Foundation Natural Sciences and Engineering Research Council of Canada Miller Institute for Basic Research in Sciences at UC Berkeley Intel (via ISTC for Cloud Computing & new ISTC for Visual Cloud) ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 44 References (in order of first appearance) [KPMWEVNBPA14] Ioannis Koltsidas, Roman Pletka, Peter Mueller, Thomas Weigold, Evangelos Eleftheriou, Maria Varsamou, Athina Ntalla, Elina Bougioukou, Aspasia Palli, and Theodore Antonakopoulos. PSS: A Prototype Storage Subsystem based on PCM. NVMW, 2014. [DJX09] Xiangyu Dong, Norman P. Jouupi, and Yuan Xie. PCRAMsim: System-level performance, energy, and area modeling for phase-change RAM. ACM ICCAD, 2009. [XDJX11] Cong Xu, Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. Design implications of memristor-based RRAM cross-point structures. IEEE DATE, 2011. [GZDCH13] Nad Gilbert, Yanging Zhang, John Dinh, Benton Calhoun, and Shane Hollmer, "A 0.6v 8 pj/write nonvolatile CBRAM macro embedded in a body sensor node for ultra low energy applications", IEEE VLSIC, 2013. [GMR98] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms, SIAM J. on Computing 28(2), 1998. [DHW97] Cynthia Dwork, Maurice Herlihy, and Orli Waarts. Contention in Shared Memory Algorithms. ACM STOC, 1993. [YA95] Jae-Heon Yang and James H. Anderson. A Fast, Scalable Mutual Exclusion Algorithm. Distributed

Computing 9(1), 1995. [GT05] Eran Gal and Sivan Toledo. Algorithms and data structures for flash memories. ACM Computing Surveys, 37(2), 2005. [EGMP14] David Eppstein, Michael T. Goodrich, Michael Mitzenmacher, and Pawel Pszona. Wear minimization for cuckoo hashing: How not to throw a lot of eggs into one basket. ACM SEA, 2014. [BFGGS16] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, and Julian Shun. Efficient Algorithms with Asymmetric Read and Write Costs. ESA, 2016. [HK81] Jia-Wei Hong and H. T. Kung. I/O complexity: The red-blue pebble game. ACM STOC, 1981. [CS76] Stephen Cook and Ravi Sethi. Storage requirements for deterministic polynomial time recognizable languages. JCSS, 13(1), 1976. ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 45 References (cont.) [BFGGS15] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, and Julian Shun. Sorting with Asymmetric Read and Write Costs. ACM SPAA, 2015. [BBFGGMS16] Naama Ben-David, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, Charles McGuffey, and Julian Shun. Parallel Algorithms for Asymmetric Read-Write Costs. ACM SPAA, 2016. [Sitchinava16] Nodari Sitchinava, personal communication, June 2016. Some additional related work: [CDGKKSS16] Erin Carson, James Demmel, Laura Grigori, Nicholas Knight, Penporn Koanantakool, Oded Schwartz, and Harsha Vardhan Simhadri. Write-Avoiding Algorithms. IEEE IPDPS, 2016. [CGN11] Shimin Chen, Phillip B. Gibbons, and Suman Nath. Rethinking Database Algorithms for Phase Change

Memory. CIDR, 2011. [Viglas14] Stratis D. Viglas. Write-limited sorts and joins for persistent memory. VLDB 7(5), 2014. ow Emerging Memory Technologies Will Have You Rethinking Algorithm Design Phillip B. Gibbons 46