Internetworking - SIU

Internetworking - SIU

ECE 526 Network Processing Systems Design Lecture 3-6: Implementation Models A rather small set of key concepts is enough. Only by learning the essence of each topic, and by carrying along the least amount of mental baggage at each step, will the student emerge with a good overall understanding of the subject! - Carver Mead and Lynn Conway Ning Weng ECE 526

2 Goals of Models Before we can play with improvements, we must know the rules of the game. Algorithmics uses four separate areas: protocols, architecture, OS and algorithms Question: How can a logic designer understand protocols and how can an algorithm designer understand hardware Answer: Use simple models that have explanatory and predictive power

Leave details to the specialist Ning Weng ECE 526 3 Outline of Models

Networking protocols Hardware Architecture Operating systems Ning Weng ECE 526 4 Transport Models

Presents illusion of two shared queues in each direction despite unreliable network using seq # and retransmission Connection between local queues named ports at IP addresses Ning Weng ECE 526 5 Routing Protocols

Forwarding: router determines next hop/router based on forwarding table to send a packet. Speed-critical. Routing: routers work together to populate/build forwarding table RIP (Exchange distance estimates) OSPF (Link state) BGP (Path vector, based on policy) Ning Weng

ECE 526 6 Abstract Model of Protocols Interfaces: used by applications or higher level protocols Packet (message) format: IP,TCP, UDP Typical protocol operations Data copy: from application to kernel; from input link to output link via switch

Demultiplexing Looking up state: longest prefix match Set timers and receive alarms Ning Weng ECE 526 7

Measures Throughput: number of jobs per second Focus of industry, ISPs wish to maximize Latency: time to complete a job E.g. time for a packet to go from one end to another (RTT) Ideal interactive time 100ms (today 250 msec across US)

Ning Weng ECE 526 8 Performance facts Backbone link speeds: Optical carrier (OC)-n n x 51.8Mbits OC48 ~ 2.5 Gbps, OC192~10Gbps

TCP and web dominance : Web (70%) and TCP (90%) Small transfers: 50% of accessed files < 50Kbytes Poor locality: 1M concurrent type of packets in backbone. 5 packets to same destination Latencies large, e.g. 100 ms for wide area Wire speed forwarding half of packets 40 bytes, many 1500 bytes Takes 8 ns to send 40 bytes at 40 Gbps Ning Weng

ECE 526 9 Case study: iSCSI Large data centers connect disks and computers with a SAN (Storage area network) to enable disk sharing. Expensive today. Single SCSI command can READ: 10Mbytes from a remote disk iSCSI over TCP/IP: must implement TCP in hardware, add iSCSI header with length. SCSI messages can be issued out of order, but TCP does not allow!!! Case study: the protocol features affect greatly on application

performance Ning Weng ECE 526 10 Models Networking (protocols)

Hardware Architecture Operating systems Ning Weng ECE 526 11 Hardware Combinatorial logic From transistors to gates Timing and power High level building blocks (standard cells)

Memories Registers, SRAM, DRAM Fancy memory tricks (page mode, interleaved) Ning Weng ECE 526 12 Combinatorial logic

A function from digital inputs to outputs Wires, transistors, resistors, capacitors Gates: AND, NAND, NOT, etc. (60 ps) More complex blocks: adders, multiplexers Different designs for same Boolean function offer different trade-offs. Time for the output to stabilize Number of transistors Power dissipation P=CV2F

Ning Weng ECE 526 13 Priority encoder timing Input I and outputs O are N-bit vectors such that O[j] = 1 iff I[j] = 1 and I[k] = 0, for all k

Ning Weng ECE 526 14 Priority encoder timing Design 2: Every output bit requires the AND of compliment of first j-1 bits Construct recursively P[j] = P[j-1]. ~I[j] Using N 2-input AND gates in series O(N) transistors but takes O(N) Summary: Design 1 fast and fat; Design 2 slow and lean

Design 3: Tradeoff design. Idea: use balanced binary tree to compute P[j] and combine partial results. Ning Weng ECE 526 15 Reduction using standard cells Build four input mux from 2 input muxes. Ning Weng

ECE 526 16 Crossbar scheduler PPE: like a priority encoder but with rotating priority. Find first 1 beyond P Arises in switch arbitration Design 1: Use a barrel shifter to shift I to the left by P bits. Apply PE Shift back by P bits.

Two shifts + PE ~ 3 log N gate delays. Each barrel shift using tree of 2-muxes ~ log (N) Ning Weng ECE 526 17 Crossbar scheduler Smarter scheme due to Gupta/McKeown First check if any bit set at or after P If yes, apply PE on bits after P If no, apply PE on bits before P Used in Tiny Tera and Abrizio

Tested using TI libraries 2x fast and 3x less area For a 32 port router. Ning Weng ECE 526 18 Memories Router forwarding done using logic but packets

stored in memories. Memory are significant bottlenecks Slower than logic delays Different subsystems require different types 200ms worth of packet buffering (DRAM) 1Mbyte of fast SRAM for forwarding tables Need different models Ning Weng ECE 526 19

Registers Need to store bit such that it stays indefinitely (absence of writes/power failures) Register is ordered collection of flip-flops Access from logic to a register ~ 0.5-1ns Ning Weng ECE 526 20 SRAM

N registers addressed by log N bits Addresses. Self refreshing. Needs a decoder to translate A to unary so add decode delay. On chip SRAM 1-2ns, Off-chip 5-10ns. Ning Weng ECE 526 21 DRAM SRAM flip-flop needs 5 transistors. DRAM needs one transistor plus capacitor

(so more dense) Leakage fixed by refresh. Slower because output not driven by the power supply as in SRAM 40-60ns Ning Weng ECE 526 22 Memories Page mode DRAM (latency 40-60ns) Direct decoding hard: use divide and conquer

Decode in stages, row first and column next Reduces decoder from O(N) to O(N) gates Accesses within rows do not require Row address strobe. Ning Weng ECE 526 23 Interleaved DRAMs

Increase DRAM throughput (same latency) While Bank 1 works, send addresses to Bank 2, etc. SDRAM uses 2 banks, RDRAM uses 16 banks. If consecutive accesses to different banks, bandwidth increased by a factor of B Ning Weng ECE 526

24 Example: Pipelined flow id lookup Lookup 1M 96-bit packet ID for accounting using binary tree in 128nsec (OC-48) SRAM infeasible. DRAM too slow (20 accesses) Use interleaved memory + pipeline

Direct RDRAM runs at 800MHz, read access 60ns 216 too small. So use RAMBUS page mode to retrieve two 96-bit keys + three 20bit pointers in one 256 bit access. Ning Weng ECE 526 25 Pin count limitations matter Example: Router with five 10GBps links Overall buffering - 200ms * 50Gbps Want to use DRAM

Bw required 2 * 50 = 100 Gbits/sec + overhead ~ 200 Gbps Use single RDRAM with 16 banks Peak bw of 1.6 Gbytes or 13Gbps Accessing each RDRAM requires 64 pins for data, 25 for address/control ~90. 200 Gbps memory requires 16 RDRAMs ~ 1440 pins Chips have pin limitations (typically <1000pins)

Thus requires at least one more chip. Ning Weng ECE 526 26 Real world: chip design process Box architect partitions functions between chips. Design teams write specifications, logic designers write RTL using Verilog. Synthesize gates to generate circuits. Timing checks to change design. Finally chip tapes out, manufactured, yield inspected. Easy to add features before RTL, second spins are

expensive. Ning Weng ECE 526 27 Next Lecture Networking (protocols) Hardware Device architecture Operating systems Ning Weng

ECE 526 28 Architecture Models Optimizing network performance requires optimizing all data paths. Through the internals of source node, sink node and routers End node architectures optimized for general computation. Router architectures for Internet communication. End Node Architecture

Main memory (1GB, 50-80nsec, L2 cache, on chip SRAM registers) Direct mapped caches Cache lines, spatial locality, page mode Memory mapped I/O, DMA, I/O bus versus processor bus. Alternative endnode architecture

Packets from network Mem2. Processor Reads Mem1. Switch can alternate the accesses E.g., infiniband as a replacement for PCI What a Router Looks Like Router Architectures Major tasks:

Lookup Classification Switching Queuing Less time-critical Header verification Checksum

Route computation Generic Router Architecture Slide courtesy of Nick McKeown Generic Router Architecture Network Processors General purposes processors optimized for network traffic Motivated by unpredictable router tasks.

Often just multiple processors that work on different packets at the same time. Intel IXP has 6 processors with 5.6ns clock. Use multithreading to quickly switch contexts IXP has 4 contexts. Special purpose instructions for address lookup and other functions. Why NPUs seem like a good idea

What makes a CPU appealing for a PC. Flexibility: Supports many applications Time to market: Allows quick introduction of new applications. Future proof: Supports as-yet unthought of applications No-one would consider using fixed function ASICs for a PC Why NPUs seem like a good idea What makes a NPU appealing

Flexibility: Protocols and standards change. Time to market: Saves 18months building an ASIC. Code re-use. Future proof: New protocols emerge. Less risk: Bugs more easily fixed in s/w. Surely no-one would consider using fixed function ASICs for new networking equipment? Network Processors: Load-balancing Incoming packets dispatched to: 1. Idle processor, or 2. Processor dedicated to packets in this flow (to prevent mis-sequencing), or

3. Special-purpose processor for flow, e.g. security, transcoding, applicationlevel processing. Network Processors: Pipelining Processing broken down into (hopefully balanced) steps, Each processor performs one step of processing. Models

Networking (protocols) Hardware Device architecture Operating systems Operating System Routers usually have a lightweight OS Different from traditional Oses End-node performance dependent on OS Abstractions Uninterrupted Computation: No Interrupts Infinite Memory: just an illusion. Simple I/O: Avoid dealing directly with devices (simple writes/reads)

Uninterrupted communication Underlying mechanisms Context switching Scheduling Protection Flavors of process - increasing complexity Interrupt handlers, threads, processes

Receiver livelock in BSD Keep processing packets only to discard because apps never run. Latency depends on process Interrupts (10us), context switch (10-100us) Infinite memory Via virtual memory

Page mapping (avoids finding contiguous locations). Demand paging (use more space than memory) Slow DRAM lookup avoided with fast TLB Protection by allowing only OS to modify page tables. Simple I/O using system calls

For abstraction alone, I/O could be libraries. For security, I/O handled by device drivers. System calls, trap to kernel protection levels. More expensive function call, because of privilege escalation. Next class Implementation principles. Read chapter 3 and 4. Quiz1

Recently Viewed Presentations

  • Animation Special Effects Free Template Series Brainy Betty.

    Animation Special Effects Free Template Series Brainy Betty.

    Animated floating petals (Difficult) Tip: For best results with the animation effects on this slide, choose a picture with an object that is made up of multiple parts, like the flower in this example.
  • CAMPBELL BIOLOGY TENTH EDITION Reece  Urry  Cain  Wasserman


    Animal phyla vary greatly in morphology, from simple sponges that lack tissues and symmetry to complex vertebrates. Members of different animal phyla have similar developmental genes, but the number of miRNAs (microRNAs) varies considerably. These small RNAs are involved in...
  • XXX International Cosmic Ray Conference, July 2007, Mrida,

    XXX International Cosmic Ray Conference, July 2007, Mrida,

    a=0.26 GeV / mwe b=0.36 10-3 / mwe (1 mwe = 1/0.917 m of ice) The number of Cherenkov photons generated by a bare muon is number of channels ... The reconstructed parameter is number of photons per unit length...
  • Leading-Edge Technologies in K-12 and Post-Secondary ...

    Leading-Edge Technologies in K-12 and Post-Secondary ...

    The Story of "O" (as in Open Source) Phillip Long MIT Thursday, May 13th, 2004 [email protected] How many open source developers does it take to change a light bulb? 17 to agree about the license 17 to argue about the...
  • The Age of Absolutism - Weebly

    The Age of Absolutism - Weebly

    The Age of Absolutism. Late 1500s to the late 1700s. Chapter 16. Learning Targets. I can define what an Absolute Monarch is. I can define the concept of Divine Right. I can describe at least two advantages and disadvantages of...
  • Crisis Communications -

    Crisis Communications -

    Step Two: Developing The Plan 4 Identify the Emergency Operations Center (EOC). 5 Identify the Media Information Center (MIC). 6 Train employees. Step Three: Response This is the stage in which the crisis plan is executed. Like any other plan,...
  • Proposals for Change in Medical Negligence Cases - Pre Action ...

    Proposals for Change in Medical Negligence Cases - Pre Action ...

    Liability as Employer. Liability as Occupier. Negligence. wrongdoing and fault. duty of care. breach of duty of care. injury caused by breach of duty of care. Duty of Care. Common Law Duty of Care. Statutory Duty of Care. Common Law...
  • Professional curosity

    Professional curosity

    How to deal with disguised compliance . Focus on the needs, voice and 'lived experience' of the child, young person or adult. Avoid being encouraged to focus to extensively on the needs and presentation of the adults or carers -...