The Kokkos C++ Performance Portability EcoSystem Unclassified Unlimited

The Kokkos C++ Performance Portability EcoSystem Unclassified Unlimited

The Kokkos C++ Performance Portability EcoSystem Unclassified Unlimited Release C. R. Trott, D. Sunderland, N. Ellingwood, D. Ibanez, S. Bova, J. Miles, V. Dang David S. Hollman Sandia National Laboratories/CA Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energys National Nuclear Security Administration under contract DE-NA-0003525. SAND2019-3723 PE Libraries Applications SNL NALU Wind Turbine CFD Frameworks UT Uintah Combustine SNL LAMMPS Molecular Dynamics

ORNL Raptor Large Eddy Sim Kokkos ORNL Summit IBM Power9 / NVIDIA Volta LANL/SNL Trinity Intel Haswell / Intel KNL ANL Aurora21 Intel Xeon CPUs + Intel Xe GPUs SNL Astra ARM Architecture Goals For Performance Portability One coherent approach to low level HPC performance portability needs Parallel Execution Data Structures and Management Math Kernels Tools Limit cognitive overload Orthogonalization of concerns Most of the time no explicit reference to backends (e.g. CUDA, or OpenMP) Off ramp via standards integration to limit scope Invest into C++ standards work to make Kokkos a sliding window of advanced capabilities

Kokkos EcoSystem Kokkos Development Team Dedicated team with a number of staff working most of their time on Kokkos Main development team at Sandia in CCR Sandia Apps are customers Kokkos Core: C.R. Trott, D. Sunderland, N. Ellingwood, D. Ibanez, S. Bova, J. Miles, D. Hollman, V. Dang, soon: H. Finkel, N. Liber, D. Lebrun-Grandie, A. Prokopenko former: H.C. Edwards, D. Labreche, G. Mackey Kokkos Kernels: S. Rajamanickam, N. Ellingwood, K. Kim, C.R. Trott, V. Dang, L. Berger, Kokkos Tools: S. Hammond, C.R. Trott, D. Ibanez, S. Moore Kokkos Support: C.R. Trott, G. Shipman, G. Lopez, G. Womeldorff, former: H.C. Edwards, D. Labreche, Fernanda Foertter Kokkos Core Abstractions Kokkos Data Structures Memory Spaces (Where) - HBM, DDR, Non-Volatile, Scratch Memory Layouts - Row/Column-Major, Tiled, Strided

Memory Traits (How) - Streaming, Atomic, Restrict Parallel Execution Execution Spaces (Where) - CPU, GPU, Executor Mechanism Execution Patterns - parallel_for/reduce/scan, task-spawn Execution Policies (How) - Range, Team, Task-Graph Patterns and Policy Reduce cognitive overload by reusing the same code structure Parallel_Pattern( ExecutionPolicy , FunctionObject [, ReductionArgs]) // Basic parallel for: parallel_for( N, Lambda); // Parallel for with dynamic scheduling: parallel_for( RangePolicy>(0,N), Lambda); // Parallel Reduce with teams: parallel_reduce( TeamPolicy<>(N,AUTO), Lambda, Reducer); // Parallel Scan with a nested policy parallel_scan( ThreadVectorRange(team_handle,N), Lambda); // Restriction pattern equivalent to #pragma omp single single( PerTeam(team_handle), Lambda); // Task Spawn task_spawn( TeamTask(scheduler, dependency), Task);

Orthogonalize further via require mechanism to customize exec policy auto exec_policy_low_latency = require(exec_policy, KernelProperty::HintLightWeight); Kokkos Core Capabilities Concept Example Parallel Loops parallel_for( N, KOKKOS_LAMBDA (int i) { ...BODY }); Parallel Reduction parallel_reduce( RangePolicy(0,N), KOKKOS_LAMBDA (int i, double& upd) { BODY... upd += ... }, Sum<>(result)); Tightly Nested Loops parallel_for(MDRangePolicy > ({0,0,0},{N1,N2,N3},{T1,T2,T3}, KOKKOS_LAMBDA (int i, int j, int k) {BODY...}); Non-Tightly Nested

Loops parallel_for( TeamPolicy>( N, TS ), KOKKOS_LAMBDA (Team team) { COMMON CODE 1 ... parallel_for(TeamThreadRange( team, M(N)), [&] (int j) { ... INNER BODY... }); COMMON CODE 2 ... }); Task Dag task_spawn( TaskTeam( scheduler , priority), KOKKOS_LAMBDA (Team team) { BODY }); Data Allocation View a(A,N,M); Data Transfer deep_copy(a,b); Atomics atomic_add(&a[i],5.0); View> a(); a(i)+=5.0; Exec Spaces Serial, Threads, OpenMP, Cuda, HPX (experimental), ROCm (experimental)

More Kokkos Capabilities MemoryPool Reducers DualView parallel_scan ScatterView OffsetView StaticWorkGraph LayoutRight sort kokkos_malloc LayoutLeft kokkos_free Bitset Vector ScratchSpace RandomPool UnorderedMap

ScratchSpace LayoutStrided ProfilingHooks Kokkos Kernels BLAS, Sparse and Graph Kernels on top of Kokkos and its View abstraction Scalar type agnostic, e.g. works for any types with math operators Layout and Memory Space aware Can call vendor libraries when available View have all their size and stride information => Interface is simpler // BLAS // Kokkos Kernels int M,N,K,LDA,LDB; double alpha, beta; double *A, *B, *C; double alpha, beta; View A,B,C; dgemm('N','N',M,N,K,alpha,A,LDA,B,LDB,beta,C,LDC); gemm('N','N',alpha,A,B,beta,C); Interface to call Kokkos Kernels at the teams level (e.g. in each CUDA-Block) parallel_for("NestedBLAS", TeamPolicy<>(N,AUTO), KOKKOS_LAMBDA (const team_handle_t& team_handle) { // Allocate A, x and y in scratch memory (e.g. CUDA shared memory) // Call BLAS using parallelism in this team (e.g. CUDA block) gemv(team_handle,'N',alpha,A,x,beta,y) }); Kokkos-Tools Profiling & Debugging

Performance tuning requires insight, but tools are different on each platform Insight into KokkosTools: Provide common set of basic tools + hooks for 3rd party tools One common issue abstraction layers obfuscate profiler output Kokkos hooks for passing names on Provide Kernel, Allocation and Region No need to recompile Uses runtime hooks Set via env variable Improved Fine Grained Tasking Generalization of TaskScheduler abstraction to allow user to be generic with respect to scheduling strategy and queue Implementation of new queues and scheduling strategies: Single shared LIFO Queue (this was the old implementation) Multiple shared LIFO Queues with LIFO work stealing Chase-Lev minimal contention LIFO with tail (FIFO) stealing Potentially more Reorganization of Task, Future, TaskQueue data structures to

accommodate flexible requirements from the TaskScheduler For instance, some scheduling strategies require additional storage in the Task Questions: David Hollman Fibonacci 30 (V100) 7 Million Tasks per Second 6 5 4 3 2 1 0 Old Single Queue Multi Queue New Single Queue Chase-Leve MQ Kokkos Remote Spaces: PGAS Support

Example DGX2 V100 V100 V100 V100 V100 V100 V100 V100 First super-node 300GB/s per GPU link NVSwitch NVSwitch PGAS Models may become more viable for HPC with both changes in network architectures and the emergence of super-node architectures V100

V100 V100 V100 V100 V100 V100 V100 Idea: Add new memory spaces which return data handles with shmem semantics to Kokkos View View a(A,N,M); Operator a(i,j,k) returns: template<> struct NVShmemElement { NVShmemElement(int pe_, double* ptr_):pe(pe_),ptr(ptr_) {} int pe; double* ptr; void operator = (double val) { shmem_double_p(ptr,val,pe); } };

PGAS Performance Evaluation: miniFE Test Problem: CG-Solve 3 Variants Full use of SHMEM Inline functions by ptr mapping Store 16 pointers in the View Explicit by-rank indexing Make vector 2D Encode rank in column index CGSolve Performance 6000 5000 Throughput Using the miniFE problem N^3 Compare to optimized CUDA MPI version is using overlapping DGX2 4 GPU workstation Dominated by SpMV (Sparse Matrix Vector Multiply) Make Vector distributed, and store global indicies in Matrix

4000 3000 2000 1000 0 100^3 200^3 400^3 Warning: I dont think this is a viable thing in the next MPI SHMEM couple years for most of our apps!! SHMEM-Inline SHMEM-Index Kokkos Based Projects Production Code Running Real Analysis Today We got about 12 or so. Production Code or Library committed to using Kokkos and actively porting

Somewhere around 30 Packages In Large Collections (e.g. Tpetra, MueLu in Trilinos) committed to using Kokkos and actively porting Somewhere around 50 Counting also proxy-apps and projects which are evaluating Kokkos (e.g. projects who attended boot camps and trainings). Estimate 80-120 packages. Kokkos Users Uintah Timeper Timestep[s] System wide many task framework from Reverse Monte Carlo Ray Tracing 64^3 cells University of Utah led by Martin Berzins 16 Multiple applications for combustion/radiation 14 simulation 12 Structured AMR Mesh calculations 10 Prior code existed for CPUs and GPUs 8 6 Kokkos unifies implementation

4 Improved performance due to constraints in 2 Kokkos which encourage better coding practices 0 Questions: Dan Sunderland CPU GPU Original Kokkos KNL Questions: Stan Moore Widely used Molecular Dynamics Simulations package Focused on Material Physics Over 500 physics modules Kokkos covers growing subset of those REAX is an important but very complex potential USER-REAXC (Vanilla) more than 10,000 LOC Kokkos version ~6,000 LOC

LJ in comparison: 200LOC Used for shock simulations Architecture Comparison Example in.reaxc.tatb / 196k atoms / 100 steps Architecture Comparison Example in.reaxc.tatb / 24k atoms / 100 steps 200 20 18 16 14 12 10 8 6 4 2 0 T im e[s] T im e[s ] LAMMPS

150 100 50 0 Vanilla Kokkos Vanilla Kokkos Alexa Questions: Dan Ibanez Best Threaded TimesSingle-Rank Time in s Portably performant shock hydrodynamics application Solving multi-material problems for internal Sandia users Uses tetrahedral mesh adaptation 120 80 40 0 N

lK e t In L N VI D IA 0 K4 N D VI IA 0 K8 N VI

D IA 00 1 P l te n I on e X 0 87 4 E7 K el t In

N C All operations are Kokkos-parallel Test case: metal foil expanding due to resistive heating from electrical current. SPARC Courtesy of: Micah Howard Goal: solve aerodynamics problems for Sandia (transonic and hypersonic) on leadership class supercomputers Solves compressible Navier-Stokes equations Perfect and reacting gas models Laminar and RANS turbulence models -> hybrid RANS-LES Primary discretization is cell-centered finite volume Research on high-order finite difference and discontinuous Galerkin discretizations Structured and unstructured grids 4 Sierra nodes (16x V100) equivalent to ~40 Trinity nodes

(80x Haswell 16c CPU) Aligning Kokkos with the C++ Standard Long term goal: move capabilities from Kokkos into the ISO standard Concentrate on facilities we really need to optimize with compiler Move accepted features to legacy support Kokkos Propose for C++ Kokkos Legacy Implemented legacy capabilities in terms of new C++ features C++ Standard C++ Backport Back port to compilers we got C++ Features in the Works First success: atomic_ref in C++20 Provides atomics with all capabilities of atomics in Kokkos atomic_ref(a[i])+=5.0; instead of atomic_add(&a[i],5.0); Next thing: Kokkos::View => std::mdspan Provides customization points which allow all things we can do with

Kokkos::View Better design of internals though! => Easier to write custom layouts. Also: arbitrary rank (until compiler crashes) and mixed compile/runtime ranks We hope will land early in the cycle for C++23 (i.e. early in 2020) Also C++23: Executors and Basic Linear Algebra (just began design work) Towards C++23 Executors C++ standard is moving towards more asynchronicity with Executors Dispatch of parallel work consumes and returns new kind of future Aligning Kokkos with this development means: Introduction of Execution space instances (CUDA streams work already) DefaultExecutionSpace spaces[2]; partition( DefaultExecutionSpace(), 2, spaces); // f1 and f2 are executed simultaneously parallel_for( RangePolicy<>(spaces[0], 0, N), f1); parallel_for( RangePolicy<>(spaces[1], 0, N), f2); // wait for all work to finish fence(); Patterns return futures and Execution Policies consume them f1

f2a auto fut_1 = parallel_for( RangePolicy<>(Funct1, 0, N), f1 ); auto fut_2a = parallel_for( RangePolicy<>(Funct2a, fut_1,0, N), f2a); auto fut_2b = parallel_for( RangePolicy<>(Funct2b, fut_1,0, N), f2b); auto fut_3 = parallel_for( RangePolicy<>(Funct3, all(fut_2a,fut2_b),0, N), f3); fence(fut_3); f2b f3

Recently Viewed Presentations

  • 哲學基本問題 課程介紹 - ocw.aca.ntu.edu.tw

    哲學基本問題 課程介紹 - ocw.aca.ntu.edu.tw

    對於 IBE 之可靠性的質疑 (1): Hungerford's objection. ... Nancy Cartwright (1983). How the Laws of Physics Lie. Oxford University Press. pp.88-91. Peter Lipton (2004). Inference to the Best Explanation. Routledge/Taylor and Francis Group.依據著作權法第46、52、65條合理使用
  • Chapter 7

    Chapter 7

    sentry outside the customs house, Private Hugh White, called the boy over and clubbed him on the head. Garrick's companions yelled at the sentry, and a British sergeant chased them away. The apprentices returned with more locals, shouting insults at...
  • Prepping for the APHG Exam - mr. clark&#x27;s guide to geography

    Prepping for the APHG Exam - mr. clark's guide to geography

    Try to scale your own exams based on previous year scaled scoring. For example, the top 20% of your students earn a five, next 15% earn a 4, next 10% earn a 3, etc….This way, giving your students an idea...
  • CHAPTER OUTLINE Benefits of Skill -Related Fitnes s

    CHAPTER OUTLINE Benefits of Skill -Related Fitnes s

    Specific Exercise Considerations: Heat Cramps Specific Exercise Considerations: Heat Exhaustion Specific Exercise Considerations: Heat Stroke Specific Exercise Considerations: Heat Stroke Specific Exercise Considerations: Heat Stroke Exercise-Related Injuries The four most common causes of injuries High-impact activities Rapid conditioning ...
  • Welcome to Management MAN 101 Dasar-dasar Manajemen Irma

    Welcome to Management MAN 101 Dasar-dasar Manajemen Irma

    Psychology Today, Feb 1983 1. Insensitive to others 2. Cold, aloof, arrogant 3. Betrayal of trust 4. Overly ambitious 5. Specific performance problems with the business 6. Overmanaging: unable to delegate or build a team 7. Unable to staff effectively...
  • ECE 498AL Lecture 4: GPU as part of the PC ... - Carl Pearson

    ECE 498AL Lecture 4: GPU as part of the PC ... - Carl Pearson

    Feed-Forward Networks andGradient-Based Training. ECE408 / CS483 / CSE 408. Fall 2017. Carl Pearson. [email protected] Recap: Machine Learning. An important way of building applications whose logic is not fully understood. Use labeled data - data that come with the input...
  • Virginia Opioid Addiction ECHO* Clinic

    Virginia Opioid Addiction ECHO* Clinic

    Monthly 2 hours tele-ECHO Clinics. Every tele-ECHO clinic includes 2 case presentations and a didactic presentation. Didactic presentations are developed and delivered by inter-professional experts in Sickle Cell Disease care and management
  • Quantum-Noise-Limited Cavity Ring-Down Spectroscopy in the ...

    Quantum-Noise-Limited Cavity Ring-Down Spectroscopy in the ...

    Quantum-Noise-LimitedCavity Ring-Down Spectroscopyin the Mid-Infrared. Adam J. Fleisher,* David A. Long, Qingnan Liu, and Joseph T. Hodges. Material Measurement Laboratory. National Institute of Standards & Technology