Matthias Blumrich, Ted Jiang, and Larry Dennison November 13 ...

Matthias Blumrich, Ted Jiang, and Larry Dennison November 13 ...

EXPLOITING IDLE RESOURCES IN A HIGHRADIX SWITCH FOR SUPPLEMENTAL STORAGE Matthias Blumrich, Ted Jiang, and Larry Dennison November 13, 2018 SUPPLEMENTAL STORAGE Why is it useful? To enable capabilities in the network: end-to-end retransmission, congestion control, order enforcement, in-network collectives, deadlock recovery, etc. Where can it be found? Existing, unused buffer memory How can it be accessed? Existing excess internal switch bandwidth 2 WHY? End-to-End Retransmission Store copies of packets in the first-hop switch until they are acknowledged Delete when acknowledged (common case) or retransmit if lost Supplemental storage usage pattern: store, then delete Great for network reliability! But other uses too. Explicit Congestion Notification (ECN) enhancement When network congestion is detected, send throttling commands back to problem senders Our idea: temporarily store congestion-causing packets to allow others to proceed Supplemental storage usage pattern: store and retrieve

3 WHERE? Switch A Port Switch B Port Output Unused! Retransmission Buffer Input Cut-Through Unused! Buffer Input Unused! Cut-Through Buffer Short Link Output Unused! Retransmission Buffer

Buffers are sized for bandwidth X roundtrip-time of the longest link Short links only use a small amount of buffering because storage is relative to actual link RTT 4 HOW? Input 1 Exploit excess internal bandwidth on existing port-to-port datapaths Input 2 Input 3 Input 4 Input 5 Input 6 We studied a tiled switch Input 7 Input 8 Regular architecture that scales to many ports Input 9 Input 10 Input 11 Provides any input-to-output permutation

Input 12 Output 12 Output 11 Output 10 Output 9 Output 8 Output 7 Output 6 Output 5 Output 4 Output 3 Output 2 Output 1 Other switch architectures should work too 5

STASHING Every port donates its unused memory to a common pool of supplemental storage Any port that needs additional storage can read/write the common pool Example: ports connected to endpoints storing packets for end-to-end retransmission Example: congested ports diverting packets until ECN kicks in 6 ARCHITECTURE 7 row buses BASELINE TILED SWITCH Input 1 Input 2 Each input drives a single, multidrop row bus Input 3 Input 4 Input 5 Input 6 Each tile drives point-to-point column buses to outputs

Input 7 Input 8 Input 9 Input 10 Input 11 Packets proceed in stages between buffers Input 12 column buses Output 12 Output 11 Output 9 Output 10 Output 8 Output 7 Output 6 Output 5

Output 4 Output 3 Output 2 Output 1 Stashing takes advantage of the extra bandwidth due to row bus multicast and many column buses 8 BASELINE SWITCH DATAPATH Row Buffers Cut-Through Buffers Crossbar to interleave all inputs of a row to all outputs of a column Row Bus In 1

Tile In 2 ... Large port buffers (3 virtual channels) Column Buffers Column Bus Small internal VC buffers Out 4 ... Retransmission Buffers ... 9 STASHING SWITCH DATAPATH Two internal VCs are added for storage (S) and retrieval (R)

Cut-Through Buffers In 1 Row Bus R S Stashing (Unused) Port buffer memory is virtually partitioned and unused portions are combined for stashing storage S Stashing (Unused) Out 1 Tile Column Bus

S R Retransmission Buffers Extra read and write datapaths are added to the port memories 10 STORING END-TO-END PACKETS Cut-Through Buffers In 1 Copy for stashing is written together with the normal VC (multicast) Row Bus R S Stashing S Stashing

Out 6 S Column Bus Tile Note: storage and normal VC can be in the same column (tile) or different R Retransmission Buffers 11 STORING END-TO-END PACKETS Cut-Through Buffers In 1 Row Bus R S Stashing S

Stashing Out 6 Tile Stashing copy requires its own column bus transfer Column Bus S R Retransmission Buffers 12 STORING END-TO-END PACKETS Cut-Through Buffers In 1 Row Bus R S

Stashing S Stashing Out 6 S R Retransmission Buffers Column Bus Tile End-to-end summary: No additional row bandwidth because of multicast Double the column bandwidth because stashing can be at any port 13 STORING ECN PACKETS Packets are diverted to any port that has stashing space Cut-Through

Buffers In 1 Only stashing VC used (no copy) Row Bus R S Stashing S Stashing Out 6 Tile Column Bus S R Retransmission Buffers 14 RETRIEVING ECN PACKETS

Packets are returned to their original routes Cut-Through Buffers In 6 Row Bus R S Stashing S Stashing Out 3 S R Retransmission Buffers Tile ECN summary: Column Bus Two full switch traversals

doubles the row and column bandwidth But only for stashing traffic 15 EXPERIMENTS 16 DRAGONFLY NETWORK G G G G G S S S S G G

S G G Group G G Scalable and cost-efficient topology due to a high degree of link sharing Network consists of fully-connected groups of fully-connected switches 17 DRAGONFLY STASHING Like most large-scale topologies, the dragonfly has asymmetric link lengths Link Type Endpoint Within group Length Ports per Switch Port Buffers Unused Very Short (< 1m)

25 % 99 % Short (< 5m) 50 % 95 % Between Long (up to 100m) 25 % groups On a general-purpose switch, buffers sized for the long links None 72% of all port buffer memory is unused and available for stashing Because 75% of the ports are connected to short links 18 EXPERIMENTAL METHOD Simulated a 3080-node dragonfly network using 20-port switches Each switch is a 4x4 array of 5x5 tiles, with 30% internal clock speedup Used SST/Macro (system) together with Booksim (switches) to simulate: Synthetic uniform-random traffic Six MPI application traces: BIGFFT, AMG, MultiGrid, Fill Boundary, AMR, and MiniFE

19 END-TO-END RETRANSMISSION Modeled stashing of packets when sent, and deletion of stashed copies when acknowledged Did not model actual retransmissions because they are rare and not time-critical Goal: make sure the common store/delete operations do no harm Goal: determine whether there is sensitivity to the total amount of stashing storage Compared four models: Baseline tiled switch Stashing switch Stashing switch with only 50% of the stashing buffers available Stashing switch with only 25% of the stashing buffers available 20 END-TO-END RETRANSMISSION Application runtime normalized to the baseline tiled switch Normalized Runtime 1.1 Stashing primarily does no harm across applications 1 0.9 Sometimes stashing

improves the runtime (probably by reducing congestion) 0.8 0.7 0.6 0.5 BIGFFT FillBoundary Baseline AMR MiniFE MulitGrid AMG Stash 100% Cap. 21 END-TO-END RETRANSMISSION SENSITIVITY Added switches with reduced stashing capacity 1.4

Apps that send the most data slow down when stashing memory is exhausted 1.3 Normalized Runtime 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 BIGFFT Baseline FillBoundary AMR Stash 100% Cap. MiniFE

Stash 50% Cap. MulitGrid AMG Stash 25% Cap. 22 END-TO-END RETRANSMISSION Network Latency (us) 3 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Offered Load (flits/s/node) 1 Accepted Throughput (flits/s/node)

Uniform-random latency and throughput 1 0.8 0.6 Baseline Stash 100% Cap. Stash 50% Cap. Stash 25% Cap. Baseline Stash 0.4 100% Cap. Stash 50% Cap. Stash 25% Cap. 0.2 0 0 0.2 0.4 0.6 0.8 1 Offered Load (flits/s/node)

Stashing does no harm, except when buffering is reduced to 25% of the total 23 ECN TRANSIENT ABSORPTION Simulated pairs of applications: Well-behaved victim injects at 40% bandwidth to many destinations Aggressor injects at full bandwidth to a small number of destinations Measured the performance of the victim only Goal: determine if head-of-line blocking suffered by the victim is reduced Goal: determine if the worst-case latency of the victim is reduced 24 Victim Average Latency ( ms) ECN TRANSIENT ABSORPTION 0.8 Baseline Stash 100% Cap. Stash 50% Cap. 0.75 0.7 0.65 0.6 0.55 0.5

Aggressor starts 0 10 20 30 ECN succeeds 40 50 60 70 80 90 100 Time (ms) Stashing almost completely eliminates excess victim latency 25 ECN TRANSIENT TAIL LATENCY

Cumulative distribution of victim packet latencies Fraction of Packets 100 Baseline w/o Aggresssor Baseline Stash 100% Cap. Stash 50% Cap. 10-1 10 -2 10-3 10 -4 10-5 10-6 0 1 2 3 4 5 6 7 8 9 10 Network Latency (ms) Stashing reduces the worst-case (tail) latency to just 3x the baseline without aggressor 26 CONCLUSIONS We believe that many interesting network capabilities are enabled by additional

switch memory Such as end-to-end retransmission and congestion mitigation Stashing provides access to existing, unused switch memory via existing unused bandwidth This initial evaluation indicates that stashing is worth further exploration For details, please read the paper! 27

Recently Viewed Presentations

  • Integrated Urban Modeling System for the Community WRF

    Integrated Urban Modeling System for the Community WRF

    Incorporating Building Morphological Data for Houston Test Case Two-way coupling WRF/CFD through MCEL (Model Coupling Environmental Library) Composite NEXRAD Radar Valid 6/8/03 12Z 4 km WRF BAMEX realtime 12-h forecast Reflectivity Collaborative partnership, principally among NCAR, NOAA, DoD, FAA, AFWA,...
  • Les attributions: Déterminants et conséquences

    Les attributions: Déterminants et conséquences

    Conséquences des attributions. Pourquoi étudier les attributions? L'explication que l'on donne à une situation a une grande influence sur nos comportements face à cette situation.
  • Finance Quarterly meeting

    Finance Quarterly meeting

    Points of Sale (POS) are those tickets that finalize the transaction at the time of the purchase. They are given at time of the purchase and no further invoices will be sent to Accounts Payable. Example: Kroger, Lowe's, Italiano's, Chick...
  • Active Directory Trusts - Las Positas College

    Active Directory Trusts - Las Positas College

    Transitivity. TestOut Server Pro 2016: Identity. Key Terms. Shortcut: Shortcut trusts improve user logon times between two domains within a forest by reducing the amount of Kerberos authentication traffic on the network. Shortcut trusts are transitive and use Kerberos (a...
  • Streamlined Activity Manpower Document (S-AMD) January 2018 CAPT

    Streamlined Activity Manpower Document (S-AMD) January 2018 CAPT

    NOBC Code "9087" falls within the 9000-9999 classification group which is the Naval Operations Field Group and 9000-9099 which is the Staff and Fleet Command sub-Group. So the NOBC Code provides a three level hierarchy of work, while the JOBCODE...
  • Human Heredity - Winston-Salem/Forsyth County Schools

    Human Heredity - Winston-Salem/Forsyth County Schools

    Patterns of Heredity and Human Genetics Section 1 Mendelian Inheritance of Human Traits Pedigree A pedigree is a graphic representation of genetic inheritance. Symbols are used to show the trait being studied and family relationships Answer the following about the...
  • Simulation Software

    Simulation Software

    Simulation Software Discrete-Event System Simulation 5th Edition Chapter 4 * AnyLogic Supports: discrete event, agent-based, system dynamics (& combination) Hybrid: discrete & continuous Object library Java models, publish as applets Animation, Statistics, optimization, debugger * Arena Discrete & Continuous systems...
  • When Sleep Hygiene is Not Enough

    When Sleep Hygiene is Not Enough

    Martyn 23. Problem didn't surface until he moved in with his girlfriend. She noticed that he would wake with a start - seemed scared - sit bolt upright, get out of bed and walk to the window. Stand there looking...