# Computer Arithmetic, Part 6

Part VI Function Evaluation Parts Chapters 1. 2. 3. 4. Numbers and Arithmetic Representing Signed Numbers Redundant Number Systems Residue Number Systems 5. 6. 7. 8. Basic Addition and Counting Carry-Look ahead Adders Variations in Fast Adders Multioperand Addition III. Multiplication 9. 10. 11. 12. Basic Multiplication Schemes High-Radix Multipliers Tree and Array Multipliers Variations in Multipliers IV. Division 13. 14. 15. 16. Basic Division Schemes High-Radix Dividers Variations in Dividers Division by Convergence

17. 18. 19. 20. Floating-Point Reperesentations Floating-Point Operations Errors and Error Control Precise and Certifiable Arithmetic 21. 22. 23. 24. Square-Rooting Methods The CORDIC Algorithms Variations in Function Evaluation Arithmetic by Table Lookup 25. 26. 27. 28. 28. High-Throughput Arithmetic Low-Power Arithmetic Fault-Tolerant Arithmetic Past, Present, and Future Reconfigurable Arithmetic Elementary Operations I. Number Representation II. Addition / Subtraction V. Real Arithmetic VI. Function Evaluation VII. Implementation Topics Appendix: Past, Present, and Future May 2015

Computer Arithmetic, Function Evaluation Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN 978-0-19-532848-6). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. Behrooz Parhami Edition Released Revised Revised Revised Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 June 2007 May 2008 May 2009 May 2011 May 2012 Second May 2015

May 2010 May 2015 Computer Arithmetic, Function Evaluation Slide 2 VI Function Evaluation Learn hardware algorithms for evaluating useful functions Divisionlike square-rooting algorithms Evaluating sin x, tanh x, ln x, . . . by series expansion Function evaluation via convergence computation Use of tables: the ultimate in simplicity and flexibility Topics in This Part Chapter 21 Square-Rooting Methods Chapter 22 The CORDIC Algorithms Chapter 23 Variation in Function Evaluation Chapter 24 Arithmetic by Table Lookup May 2015 Computer Arithmetic, Function Evaluation Slide 3 May 2015 Computer Arithmetic, Function Evaluation Slide 4 21 Square-Rooting Methods Chapter Goals Learning algorithms and implementations for both digit-at-a-time and convergence square-rooting Chapter Highlights Square-rooting part of IEEE 754 standard Digit-recurrence (divisionlike) algorithms Convergence or iterative schemes Square-rooting not special case of division May 2015 Computer Arithmetic, Function Evaluation Slide 5

Square-Rooting Methods: Topics Topics in This Chapter 21.1 The Pencil-and-Paper Algorithm 21.2 Restoring Shift / Subtract Algorithm 21.3 Binary Nonrestoring Algorithm 21.4 High-Radix Square-Rooting 21.5 Square-Rooting by Convergence 21.6 Fast Hardware Square-Rooters May 2015 Computer Arithmetic, Function Evaluation Slide 6 21.1 The Pencil-and-Paper Algorithm Notation for our discussion of division algorithms: z q s Radicand Square root Remainder, z q 2 z2k1z2k2 . . . z3z2z1z0 qk1qk2 . . . q1q0 sksk1sk2 . . . s1s0 Remainder range, 0 s 2q (k + 1 digits) Justification: s 2q + 1 would lead to z = q 2 + s (q + 1)2 q Root z q 3 (q (0) 0 q 3 ) q 2 (q (1) 0 q 2 ) q 1 (q (2) 0 q 1 ) q 0 (q (3) 0 q 0 ) Radicand 26 24 22 20

s Subtracted bit-matrix Remainder Fig. 21.3 Binary square-rooting in dot notation. May 2015 Computer Arithmetic, Function Evaluation Slide 7 Example of Decimal Square-Rooting Check: 3082 + 377 = 94,864 + 377 = 95,241 Root digit Partial root q2 q1 q0 q q(0) = 0 9 5 2 4 1 z q2 = 3 q(1) = 3 sixty plus q1 9 0 5 2 6q1 q1 52 q1 = 0 q(2) = 30 0 0 5 2 4 1 60q0 q0 5241 q0 = 8 q(3) = 308 4 8 6 4

0 3 7 7 s = (377)ten q = (308)ten Fig. 21.1 Extracting the square root of a decimal integer using the pencil-and-paper algorithm. May 2015 Computer Arithmetic, Function Evaluation Slide 8 Square-Rooting as Division with Unknown Divisor q3 q2 q1 q0 q3 q2 q1 q0 z7 z6 z5 z4 z3 z2 z1 z0 q3 depends only on z7 z6 q3 q2 q1 q0 q3 q2 q1 q0 0 q3 in radix 2 Justification: For 0, the square of (q3 + )r3 is q32r6 + (2q3 + )r6, leading to a change in z7 z6 z7 z6 z5 z4 z3 z2 z1 z0 z7 z6 z5 z4 z7 z6 z5 z4 z3 z2 z7 z6 z5 z4 z3 z2 z1 z0 Similarly, q2 depends only on z7 z6 z5 z4, and so on May 2015 Computer Arithmetic, Function Evaluation Slide 9 Root Digit Selection Rule The root thus far is denoted by q (i) = (qk1qk2 . . . qki)ten Attaching the next digit qki1, partial root becomes q (i+1) = 10 q (i) + qki1 The square of q (i+1) is 100(q (i))2 + 20 q (i) qki1 + (qki1)2 100(q (i))2 = (10 q (i))2 subtracted from partial remainder in previous steps

Must subtract (10(2 q (i)) + qki1) qki1 to get the new partial remainder More generally, in radix r, must subtract (r (2 q (i)) + qki1) qki1 In radix 2, must subtract (4 q (i) + qki1) qki1, which is 4 q (i) + 1 for qki1 = 1, and 0 otherwise Thus, we use (qk1qk2 . . . qkI 0 1)two in a trial subtraction May 2015 Computer Arithmetic, Function Evaluation Slide 10 Example of Binary Square-Rooting Root digit Check: 102 + 18 = 118 = (0111 0110)two q3 q2 q1 q0 0 11 10 11 0 q Partial root q(0) = 0 01? Yes q3 = 1 q(1) = 1 101? No q2 = 0

q(2) = 10 1001? Yes q1 = 1 q(3) = 101 q0 = 0 q(4) = 1010 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 10101? No 1 0 0 1 0 s = (18)ten q = (1010)two = (10)ten Fig. 21.2 Extracting the square root of a binary integer using the pencil-and-paper algorithm. May 2015 Computer Arithmetic, Function Evaluation Slide 11 21.2 Restoring Shift / Subtract Algorithm Consistent with

q Rootthe IEEE 754 floating-point standard, we formulate Radicand our algorithms for a radicand in the range z (q (0) 0 q 3 ) 2 6 3 z < 4 1 qq (after possible 1-bit shift for an odd exponent) (q (1) 0 q ) 2 4 Subtracted 2 2 q 1 (q (2) 0 q 1 ) 2 2 bit-matrix (3) 0z 0 Radicand q 01 (q q 0< ) 24 s Fig. 21.3 1q<2 0s<4 Remainder Square root Remainder z1z0 . z1z2 . . . zl 1 . q1q2 . . . ql s1 s0 . s1 s2 . . . sl Binary square-rooting is defined by the recurrence s (j) = 2s (j1) qj(2q (j1) + 2j qj) with s (0) = z 1, q (0) = 1, s (j) = s where q (j) is the root up to its (j)th digit; thus q = q (l) To choose the next root digit qj {0, 1}, subtract from 2s (j1) the value 2q (j1) + 2j = (1 q1(j1) . q2(j1) . . . qj+1(j1) 0 1)two A negative trial difference means qj = 0

May 2015 Computer Arithmetic, Function Evaluation Slide 12 Finding the Sq. Root of z = 1.110110 via the Restoring Algorithm Fig. 21.4 Example of sequential binary square-rooting using the restoring algorithm. May 2015 ================================ z (radicand = 118/64) 0 1.1 1 0 1 1 0 ================================ s(0) = z 1 0 0 0.1 1 0 1 1 0 2s(0) 0 0 1.1 0 1 1 0 0 1 [2 1.)+2 ] 1 0.1 s(1) 1 1 1.0 0 1 1 0 0 s(1) = 2s(0) Restore 0 0 1 . 1 0 1 1 0 0 2s(1) 0 1 1.0 1 1 0 0 0 [2 1.0)+22] 1 0.0 1 s(2) 0 0 1.0 0 1 0 0 0 (2) 2s 0 1 0.0 1 0 0 0 0 [2 1.01)+23] 1 0.1 0 1

s(3) 1 1 1.1 0 1 0 0 0 s(3) = 2s(2) Restore 0 1 0 . 0 1 0 0 0 0 2s(3) 1 0 0.1 0 0 0 0 0 [2 1.010)+24] 1 0.1 0 0 1 s(4) 0 0 1.1 1 1 1 0 0 (4) 2s 0 1 1.1 1 1 0 0 0 [2 1.0101)+25] 1 0.1 0 1 0 1 s(5) 0 0 1.0 0 1 1 1 0 Root digit Partial root q0 = 1 1. q1 = 0 1.0 q2 = 1 1.01 q3 = 0 1.010 q4 = 1 1.0101 q5 = 1 1.01011

2s(5) 0 1 0.0 1 1 1 0 0 [21.01011)+26] 1 0.1 0 1 1 0 1 s(6) 1 1 1.1 0 1 1 1 1 q6 = 0 1.010110 (6) (5) s = 2s Restore 0 1 0 . 0 1 1 1 0 0 s (remainder = 156/64) 0 . 0 0 0 0 1 0 0 1q1 1 0 0 7 = 1, so round up q (root = 86/64) 1.0 1 0 1 1 0 Computer Arithmetic, Function Evaluation Slide 13 ================================ Hardware for Restoring Square-Rooting Trial Difference Shift Trial difference qkj Quotient q k Partial remainder s(j) (initial value z) MSB of 2s (j1) Load Partial Remainder Shift Divisor d

MSB of 2s(j1) k 0 Mux Load Quotient digit selector 1 k cout Put z 1 here at the outset Adder cin q j Square-Root 1 ll+2 +2 Fig. 13.5 Shift/subtract sequential restoring divider (for comparison). Complement cout Fig. 21.5 Sequential shift/subtract restoring square-rooter. May 2015 (l(l+2)-bit + 2)

Select Root Digit sub cin adder l+2 Computer Arithmetic, Function Evaluation Slide 14 Rounding the Square Root In fractional square-rooting, the remainder is not needed To round the result, we can produce an extra digit ql1: Truncate for ql1 = 0, round up for ql1 = 1 Midway case, ql1 = 1 followed by all 0s, impossible (Prob. 21.11) Example: In Fig. 21.4, we had (01.110110)two = (1.010110)two2 + (10.011100)/64 An extra iteration produces q7 = 1 So the root is rounded up to q = (1.010111)two = 87/64 The rounded-up value is closer to the root than the truncated version Original: Rounded: May 2015 118/64 = (86/64)2 + 156/(64)2 118/64 = (87/64)2 17/(64)2 Computer Arithmetic, Function Evaluation Slide 15 21.3 Binary Nonrestoring Algorithm As in nonrestoring division, nonrestoring square-rooting implies: Root digits in {1, 1} On-the-fly conversion to binary Possible final correction The case qj = 1 (nonnegative partial remainder), is handled as in the restoring algorithm; i.e., it leads to the trial subtraction of qj [2q (j1) + 2j qj ] = 2q (j1) + 2j For qj = 1, we must subtract qj [2q (j1) + 2j qj ] = [2q (j1) 2j ] Slight complication, compared with nonrestoring division which is equivalent to adding 2q (j1) 2j

This term cannot be formed by concatenation May 2015 Computer Arithmetic, Function Evaluation Slide 16 Finding the Sq. Root of z = 1.110110 via the Nonrestoring Algorithm Fig. 21.6 Example of nonrestoring binary square-rooting. May 2015 ================================ z (radicand = 118/64) 0 1.1 1 0 1 1 0 ================================ s(0) = z 1 0 0 0.1 1 0 1 1 0 2s(0) 0 0 1.1 0 1 1 0 0 1 [2 1.)+2 ] 1 0.1 s(1) 1 1 1.0 0 1 1 0 0 2s(1) 1 1 0.0 1 1 0 0 0 2 +[2 1.1)2 ] 1 0.1 1 s(2) 0 0 1.0 0 1 0 0 0 2s(2) 0 1 0.0 1 0 0 0 0 3

[2 1.01)+2 ] 1 0.1 0 1 s(3) 1 1 1.1 0 1 0 0 0 2s(3) 1 1 1.0 1 0 0 0 0 +[2 1.011)24] 1 0.1 0 1 1 s(4) 0 0 1.1 1 1 1 0 0 (4) 2s 0 1 1.1 1 1 0 0 0 [2 1.0101)+25] 1 0.1 0 1 0 1 s(5) 0 0 1.0 0 1 1 1 0 (5) 2s 0 1 0.0 1 1 1 0 0 [21.01011)+26] 1 0.1 0 1 1 0 1 s(6) 1 1 1.1 0 1 1 1 1 +[21.01011)26] 1 0.1 0 1 1 0 1 s(6) Corrected 0 1 0.0 1 1 1 0 0 s (remainder = 156/64) 0.0 0 0 0 1 0 0 q (binary) 1.0 1 0 1 1 1 q (corrected binary) 1.0 1 0 1 1 0 ================================ Computer Arithmetic, Function Evaluation Root digit Partial root q0 = 1

q1 = 1 1. 1.1 q2 = 1 1.01 q3 = 1 1.011 q4 = 1 1.0101 q5 = 1 1.01011 q6 = 1 1.010111 Negative; Correct (17/64) 1 1 1 0 0 (156/64) (156/642) (87/64) (86/64) Slide 17 Some Details for Nonrestoring Square-Rooting Depending on the sign of the partial remainder, add: (positive) (negative) Add 2q (j1) + 2j Sub. 2q (j1) 2j Concatenate 01 to the end of q (j1) Cannot be formed by concatenation

Solution: We keep q (j1) and q (j1) 2j+1 in registers Q (partial root) and Q* (diminished partial root), respectively. Then: qj = 1 qj = 1 Subtract Add 2q (j1) + 2j 2q (j1) 2j formed by shifting Q 01 formed by shifting Q*11 Updating rules for Q and Q* registers: qj = 1 qj = 1 Q := Q 1 Q := Q*1 Q* := Q 0 Q* := Q*0 Additional rule for SRT-like algorithm that allow qj = 0 as well: qj = 0 May 2015 Q := Q 0 Q* := Q*1 Computer Arithmetic, Function Evaluation Slide 18 21.4 High-Radix Square-Rooting Basic recurrence for fractional radix-r square-rooting: s (j) = rs (j1) qj(2 q (j1) + r j qj) As in radix-2 nonrestoring algorithm, we can use two registers Q and Q* to hold q (j1) and its diminished version q (j1) r j+1, respectively, suitably updating them in each step Fig. 21.3

May 2015 q Root z q 3 (q (0) 0 q 3 ) q 2 (q (1) 0 q 2 ) q 1 (q (2) 0 q 1 ) q 0 (q (3) 0 q 0 ) Radicand 26 24 22 20 Subtracted bit-matrix s Radix-4 square-rooting Remainder in dot notation Computer Arithmetic, Function Evaluation Slide 19 An Implementation of Radix-4 Square-Rooting r = 4, root digit set [2, 2] s (j) = rs (j1) qj(2 q (j1) + r j qj) Q* holds q (j1) 4j+1 = q (j1) 22j+2. Then, one of the following values must be subtracted from, or added to, the shifted partial remainder rs (j1) qj = 2 qj = 1 qj = 1 qj = 2 Subtract 4q (j1) + 22j+2 double-shift Q 010 Subtract 2q (j1) + 22j shift Q 001 Add

2q (j1) 22j shift Q*111 Add 4q (j1) 22j+2 double-shift Q*110 Updating rules for Q and Q* registers: qj = 2 qj = 1 qj = 0 qj = 1 qj = 2 Q := Q 10 Q := Q 01 Q := Q 00 Q := Q*11 Q := Q*10 Q* := Q 01 Q* := Q 00 Q* := Q*11 Q* := Q*10 Q* := Q*01 Note that the root is obtained in binary form (no conversion needed!) May 2015 Computer Arithmetic, Function Evaluation Slide 20 Keeping the Partial Remainder in Carry-Save Form As in fast division, root digit selection can be based on a few bits of the shifted partial remainder 4s (j1) and of the partial root q (j1) This would allow us to keep s in carry-save form One extra bit of each component of s (sum and carry) must be examined Can use the same lookup table for quotient digit and root digit selection To see how, compare recurrences for radix-4 division and square-rooting: Division:

s (j) = 4s (j1) qj d Square-rooting: s (j) = 4s (j1) qj (2 q (j1) + 4 j qj) To keep magnitudes of partial remainders for division and square-rooting comparable, we can perform radix-4 square-rooting using the digit set {1, , 0 , , 1} Can convert from the digit set above to the digit set [2, 2], or directly to binary, with no extra computation May 2015 Computer Arithmetic, Function Evaluation Slide 21 21.5 Square-Rooting by Convergence Newton-Raphson method f(x) Choose f(x) = x2 z with a root at x = z x (i+1) = x (i) f(x (i)) / f (x (i)) x x (i+1) = 0.5(x (i) + z / x (i)) Each iteration: division, addition, 1-bit shift Convergence is quadratic z z For 0.5 z < 1, a good starting approximation is (1 + z)/2 This approximation needs no arithmetic The error is 0 at z = 1 and has a max of 6.07% at z = 0.5 The hardware approximation method of Schwarz and Flynn, using the tree circuit of a fast multiplier, can provide a much better approximation (e.g., to 16 bits, needing only two iterations for 64 bits of precision) May 2015 Computer Arithmetic, Function Evaluation Slide 22 Initial Approximation Using Table Lookup Table-lookup can yield a better starting estimate x (0) for z For example, with an initial estimate accurate to within 28, three iterations suffice to increase the accuracy of the root to 64 bits x (i+1) = 0.5(x (i) + z / x (i)) Example 21.1: Compute the square root of z = (2.4)ten

x (0) read out from table x (1) = 0.5(x (0) + 2.4 / x (0)) x (2) = 0.5(x (1) + 2.4 / x (1)) x (3) = 0.5(x (2) + 2.4 / x (2)) = = = = 1.5 1.550 000 000 1.549 193 548 1.549 193 338 accurate to 101 accurate to 102 accurate to 104 accurate to 108 Check: (1.549 193 338)2 = 2.399 999 999 May 2015 Computer Arithmetic, Function Evaluation Slide 23 Convergence Square-Rooting without Division Rewrite the square-root recurrence as: x (i+1) = 0.5(x (i) + z / x (i)) x (i+1) = x (i) + 0.5 (1/x (i))(z (x (i))2) = x (i) + 0.5(x (i))(z (x (i))2) where (x (i)) is an approximation to 1/x (i) obtained by a simple circuit or read out from a table Because of the approximation used in lieu of the exact value of 1/x (i), convergence rate will be less than quadratic Alternative: Use the recurrence above, but find the reciprocal iteratively; thus interlacing the two computations Using the function f(y) = 1/y x to compute 1/x, we get: x (i+1) = 0.5(x (i) + z y (i)) y (i+1) = y (i) (2 x (i) y (i)) 3 multiplications, 2 additions, and a 1-bit shift per iteration Convergence is less than quadratic but better than linear May 2015

Computer Arithmetic, Function Evaluation Slide 24 Example for Division-Free Square-Rooting x (i+1) = 0.5(x (i) + z y (i)) y (i+1) = y (i) (2 x (i) y (i)) x converges to z y converges to 1/z Example 21.2: Compute 1.4, beginning with x (0) = y (0) = 1 x (1) = 0.5(x (0) + 1.4 y (0)) = y (1) = y (0) (2 x (0) y (0)) = x (2) = 0.5(x (1) + 1.4 y (1)) = y (2) = y (1) (2 x (1) y (1)) = x (3) = 0.5(x (2) + 1.4 y (2)) = y (3) = y (2) (2 x (2) y (2)) = x (4) = 0.5(x (3) + 1.4 y (3)) = y (4) = y (3) (2 x (3) y (3)) = x (5) = 0.5(x (4) + 1.4 y (4)) = y (5) = y (4) (2 x (4) y (4)) = x (6) = 0.5(x (5) + 1.4 y (5)) = 1.4 Check: (1.183 860 512)2 = May 2015 1.200 000 000 1.000 000 000 1.300 000 000 0.800 000 000 1.210 000 000 0.768 000 000 1.142 600 000 0.822 312 960 1.146 919 072 0.872 001 394

1.183 860 512 1.401 525 712 Computer Arithmetic, Function Evaluation Slide 25 Another Division-Free Convergence Scheme Based on computing 1/z, which is then multiplied by z to obtain z The function f(x) = 1/x2 z has a root at x = 1/z (f (x) = 2/x3) x (i+1) = 0.5 x (i) (3 z (x (i))2) Quadratic convergence 3 multiplications, 1 addition, and a 1-bit shift per iteration Example 21.3: Compute the square root of z = (.5678)ten x (0) read out from table x (1) = 0.5x (0) (3 0.5678 (x (0))2) x (2) = 0.5x (1) (3 0.5678 (x (1))2) z z x (2) = = = = 1.3 1.326 271 700 1.327 095 128 0.753 524 613 Cray 2 supercomputer used this method. Initially, instead of x (0), the two values 1.5 x (0) and 0.5(x (0))3 are read out from a table, requiring only 1 multiplication in the first iteration. The value x (1) thus obtained is accurate to within half the machine precision, so only one other iteration is needed (in all, 5 multiplications, 2 additions, 2 shifts) May 2015 Computer Arithmetic, Function Evaluation Slide 26 21.6 Fast Hardware Square-Rooters Combinational hardware square-rooter serve two purposes:

1. Approximation to start up or speed up convergence methods 2. Replace digit recurrence or convergence methods altogether z 1.5 2 z z 1 + z/4 z 7/8 + z/4 1 z 17/24 + z/3 Best linear approx. 0 More subranges Better approx in each May 2015 0 Subrange 1 Subrange 2 1 + (z 1)/2 1 + z/4 1 Fig. 21.7 2 z 3 4 Plot of the function z for 1 z < 4. Computer Arithmetic, Function Evaluation Slide 27

Nonrestoring Array Square-Rooters z1 Array squarerooters can be derived from the dot-notation representation in much the same way as array dividers 0 q 1 z2 Cell 1 1 z3 0 XOR z4 FA 1 q2 z5 1 0 q3 z6 z 7 0 z 8 1

q4 q Root z Radicand s6 s7 s1 s2 s3 s4 s5 (0) 6 q 3 (q 0 q 3 ) 2 Radicand z = 1 .z 2z 3z 4z 5z 6z 7z 8z q 2 (q (1) 0 q 2 ) 2 4 Subtracted Root q = 1 .q 2array q 3q 4q square-rooter Fig. Nonrestoring q 1 (q (2) 0 q 1 ) 21.8 2 2 bit-matrix Remainder s = 1 .s 2s 3s 4s 5s 6s 7s 8s q 0 (q (3) 0q0) 2 built of0 controlled add/subtract cells incorporating s May 2015 Remainder full adders (FAs) and XOR gates. Computer Arithmetic, Function Evaluation Slide 28 s8 Understanding the Array Square-Rooter Design

z1 0 q 1 z2 Cell 1 1 z3 0 XOR z4 FA 1 q2 z5 1 0 z6 q3 z 7 0 z 8 1 q4 s1 s2 s3

s4 s5 s6 s7 s8 Radicand z = 1 .z 2z 3z 4z 5z 6z 7z 8z Partial root, transferred diagonally from row to row, is appended with: Root q = 1 .q 2q 3q 4q Remainder = 1 .s 2 s 6sroot s 01 if the last root digit was 1;swith 11s 3 ifsthe last was 0 4s 5 7s 8digit May 2015 Computer Arithmetic, Function Evaluation Slide 29 Nonrestoring Array Square-Rooter in Action z 1 0 0 0 1 q 1 1

0 1 q4 1 z 3 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 Cell 1 1 0 q3 1

1 0 q2 0 0 0 z 1 2 1 s 1 1 1 1 1 1 0 1 0 0 1 1 s 2 0 1 1 1 s 3 0

z 5 0 0 0 FA 1 0 0 XOR z 4 1 s 4 0 1 1 0 0 z 6 1 1 1 0 1 0 z 7 0 1 1 1

0 s 5 z 8 0 1 0 1 s 6 1 1 0 s 7 0 1 1 s 8 .z1 z z5 z z7 z 2 z3 z 4Note 6 that 8the answer is Check: 118/256 =Radicand (10/16)2zq+== (3/256)? Root .q1 q q q 2 3 4 Remainder s = due .s1 s s s s approximate (to within 1 ulp)

to s3 there being 2 4 s5 6 s7 no 8 final correction May 2015 Computer Arithmetic, Function Evaluation Slide 30 Digit-at-a-Time Version of the Previous Example ================================ In this example, z is of that in z = 118/256 . 0 1 1 1 0 1 1 0 Fig. 21.6. Subtraction (addition) ================================ uses the term 2q + 2i (2q 2i). s (0) = z 0 0 . 0 1 1 1 0 1 1 0 2s (0) 0 0 0 . 1 1 1 0 1 1 0 Root Partial 1 (2q + 2 ) 1 1 . 1 digit root s (1) 0 0 . 0 1 1 0 1 1 0 q1 = 1 q = .1 (1) 2s 0 0 0 . 1 1 0 1 1 0 2 (2q + 2 ) 1 0 . 1 1 s (2) 1 1 . 1 0 0 1 1 0 q2 = 0 q = .10 2s (2)

1 1 1 . 0 0 1 1 0 3 +(2q 2 ) 0 0 . 1 1 1 s (3) 0 0 . 0 0 0 1 0 q3 = 1 q =.101 2s (3) 0 0 0 . 0 0 1 0 4 (2q + 2 ) 1 0 . 1 0 1 1 s (4) 1 0 . 1 1 0 1 q4 = 0 q = .1010 ================================= May 2015 Computer Arithmetic, Function Evaluation Slide 31 Square Rooting Is Not a Special Case of Division a x Multiplier p=ax z d x Multiplier But, direct realization of squarer leads to simpler and faster circuit p = x x = x2 z Divider

Divider q=z/d q = z / q = z1/2 May 2015 Multiplier, with both inputs connected to same value, becomes a squarer Divider cant be used as square-rooter via feedback connection Direct square-rooter realization does not lead to simpler or faster circuit Computer Arithmetic, Function Evaluation Slide 32 22 The CORDIC Algorithms Chapter Goals Learning a useful convergence method for evaluating trigonometric and other functions Chapter Highlights Basic CORDIC idea: rotate a vector with end point at (x,y) = (1,0) by the angle z to put its end point at (cos z, sin z) Other functions evaluated similarly Complexity comparable to division May 2015 Computer Arithmetic, Function Evaluation Slide 33 The CORDIC Algorithms: Topics Topics in This Chapter 22.1 Rotations and Pseudorotations 22.2 Basic CORDIC Iterations 22.3 CORDIC Hardware 22.4 Generalized CORDIC 22.5 Using the CORDIC Method 22.6 An Algebraic Formulation May 2015 Computer Arithmetic, Function Evaluation

Slide 34 22.1 Rotations and Pseudorotations Evaluation of trigonometric, hyperbolic, and other common functions, such as log and exp, is needed in many computations It comes as a surprise to most people that such elementary functions can be evaluated in time that is comparable to division time or a fairly small multiple of it Some groups advocate including these functions in IEEE 754, thus requiring that they be evaluated exactly, except for the final rounding Progress has been made- toward such properly rounded elementary functions, but the cost of achieving this goal is still prohibitive CORDIC is a low-cost method that achieves the reasonable accuracy of about 1 ulp, but does not guarantee proper rounding May 2015 Computer Arithmetic, Function Evaluation Slide 35 Key Ideas on which CORDIC Is Based (cos z, sin z) (1, y) 1 z tan y (1, 0) start at (1, 0) rotate by z get cos z, sin z start at (1, y) rotate until y = 0 rotation amount is tan1y COordinate Rotation DIgital Computer used this method in the1950s; modern electronic calculators also use it If we have a computationally efficient way of rotating a vector,

we can evaluate cos, sin, and tan1 functions - Rotation by an arbitrary angle is difficult, so we: Perform psuedorotations that require simpler operations Use special angles to synthesize the desired angle z z = (1) + (2) + . . . + (m) May 2015 Computer Arithmetic, Function Evaluation Slide 36 Rotating a Vector (x (i), y (i)) by the Angle (i) x(i+1) = x(i) cos (i) y(i) sin (i) = (x(i) y(i) tan (i)) / (1 + tan2 (i))1/2 y(i+1) = y(i) cos (i) + x(i) sin (i) = (y(i) + x(i) tan (i)) / (1 + tan2 (i))1/2 2 1/2 Recall that cos = 1 / (1 + tan ) (i+1) (i) (i) z =z E(i+1) y E(i+1) y (i+1)

Rotation Our strategy: Eliminate the terms (1 + tan2 (i))1/2 and choose the angles (i)) so that tan (i) is a power of 2; need two shift-adds Pseudorotation R (i+1) y (i) E(i) (i) O May 2015 Fig. 22.1 A pseudorotation step in CORDIC R (i) x (i+1) x (i) x Computer Arithmetic, Function Evaluation Slide 37 Pseudorotating a Vector (x (i), y (i)) by the Angle (i) Pseudorotation: Whereas a real rotation does not change the length R(i) of the vector, a pseudorotation step increases its length to: x(i+1) = x(i) y(i) tan (i) y(i+1) = y(i) + x(i) tan (i) z(i+1) = z(i) (i) R(i+1) = R(i) / cos (i) = R(i) (1 + tan2 (i))1/2 E(i+1) y E(i+1)

y (i+1) Rotation Pseudorotation R (i+1) y (i) E(i) (i) O May 2015 Fig. 22.1 A pseudorotation step in CORDIC R (i) x (i+1) x (i) x Computer Arithmetic, Function Evaluation Slide 38 A Sequence of Rotations or Pseudorotations x(m) = x cos((i)) y sin((i)) y(m) = y cos((i)) + x sin((i)) z(m) = z ((i)) After m real rotations by (1), (2) , . . . , (m) , given x(0) = x, y(0) = y, and z(0) = z x(m) = K(x cos((i)) y sin((i))) y(m) = K(y cos((i)) + x sin((i))) z(m) = z ((i)) After m pseudorotations by (1), (2) , . . . , (m) , given x(0) = x, y(0) = y, and z(0) = z where K = (1 + tan2 (i))1/2 is a constant if angles of rotation are always the same, differing

only in sign or direction Question: Can we find a set of angles so that any angle can be synthesized from all of them with appropriate signs? May 2015 Computer Arithmetic, Function Evaluation (3) (2) (1) Slide 39 22.2 Basic CORDIC Iterations x(i+1) = x(i) di y(i) 2i y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) di tan 1 2i CORDIC iteration: In step i, we pseudorotate by an angle whose tangent is di 2i (the angle e(i) is fixed, only direction di is to be picked) = z(i) di e(i) (i) (i) e in degrees e in radians i (approximate) (precise) 0 45.0 0.785 398 163 1 26.6 0.463 647 609 2 14.0 0.244 978 663 3 7.1 0.124 354 994 4

3.6 0.062 418 810 5 1.8 0.031 239 833 6 0.9 0.015 623 728 7 0.4 0.007 812 341 8 0.2 0.003 906 230 9 0.1 0.001 953 123 May 2015 Table 22.1 Value of the function e(i) = tan 1 2i, in degrees and radians, for 0 i 9 Example: 30 angle 30.0 45.0 26.6 + 14.0 7.1 + 3.6 + 1.8 0.9 + 0.4 0.2 + 0.1 = 30.1 Computer Arithmetic, Function Evaluation Slide 40 Choosing the Angles to Force z to Zero x(i+1) = x(i) di y(i) 2i x(0),y (0) y y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) di tan 1 2i x(2),y (2) 30 45

14 x(10) = z di e +26.6 x(3),y (3) i z(i) di e(i) = z(i+1) x(1),y (1) +30.0 Fig. 22.2 The first three 0 +30.0 45.0 = 15.0 of 10 pseudorotations 1 15.0 + 26.6 = +11.6 leading from (x(0), y(0)) to 2 +11.6 14.0 = 2.4 (10) (x , 0) in rotating by +30. 3 2.4 + 7.1 = +4.7 4 +4.7 3.6 = +1.1 5 +1.1 1.8 = 0.7 Table 22.2 Choosing the 6 0.7 +

0.9 = +0.2 signs of the rotation angles 7 +0.2 0.4 = 0.2 in order to force z to 0 8 0.2 + 0.2 = +0.0 9 +0.0 0.1 = 0.1 May 2015 Computer Arithmetic, Function Evaluation Slide 41 (i) (i) x Why Any Angle Can Be Formed from Our List Analogy: Paying a certain amount while using all currency denominations (in positive or negative direction) exactly once; red values are fictitious. \$20 \$10 \$5 \$3 \$2 \$1 \$.50 \$.25 \$.20 \$.10 \$.05 \$.03 \$.02 \$.01 Example: Pay \$12.50 \$20 \$10 + \$5 \$3 + \$2 \$1 \$.50 + \$.25 \$.20 \$.10 + \$.05 + \$.03 \$.02 \$.01 Convergence is possible as long as each denomination is no greater than the sum of all denominations that follow it. Domain of convergence: \$42.16 to +\$42.16 We can guarantee convergence with actual denominations if we allow multiple steps at some values: \$20 \$10 \$5 \$2 \$2 \$1 \$.50 \$.25 \$.10 \$.10 \$.05 \$.01 \$.01 \$.01 \$.01 Example: Pay \$12.50 \$20 \$10 + \$5 \$2 \$2 + \$1 + \$.50+\$.25\$.10\$.10\$.05+\$.01\$.01+ \$.01\$.01 We will see later that in hyperbolic CORDIC, convergence is guaranteed only if certain angles are used twice.

May 2015 Computer Arithmetic, Function Evaluation Slide 42 Angle Recoding The selection of angles during pseudorotations can be viewed as recoding the angle in a specific number system For example, an angle of 30 is recoded as the following digit string, with each digit being 1 or 1: 45.0 1 26.6 1 14.0 1 7.1 1 3.6 1 1.8 1 0.9 1 0.4 1 0.2 1 The money-exchange analogy also lends itself to this recoding view For example, a payment of \$12.50 is recoded as: \$20 \$10 \$5 \$3 \$2 \$1 \$.50 \$.25 \$.20 \$.10 \$.05 \$.03 \$.02 \$.01 1 1 May 2015 1 1

1 1 1 1 1 1 1 Computer Arithmetic, Function Evaluation 1 1 1 Slide 43 0.1 1 Using CORDIC in Rotation Mode x (i+1) = x di y 2 (i) (i) y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) di tan 1 2i = z(i) di e(i) x y(m) z(m) (m) i Make z converge to 0 by choosing

di = sign(z(i)) For k bits of precision in results, k CORDIC iterations are needed, because tan 1 2i 2I for large i 0 = K(x cos z y sin z) = K(y cos z + x sin z) =0 0 where K = 1.646 760 258 121 . . . Start with x = 1/K = 0.607 252 935 . . . and y = 0 to find cos z and sin z Convergence of z to 0 is possible because each of the angles in our list is more than half the previous one or, equivalently, each is less than the sum of all the angles that follow it Domain of convergence is 99.7 z 99.7, where 99.7 is the sum of all the angles in our list; the domain contains [/2, /2] radians May 2015 Computer Arithmetic, Function Evaluation Slide 44 Using CORDIC in Vectoring Mode x(i+1) = x(i) di y(i) 2i y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) di tan 1 2i Make y converge to 0 by choosing di = sign(x(i)y(i)) x(m) = K(x2 + y2)1/2 y(m) = 0 z(m) = z + tan 1(y / x) 0 where K = 1.646 760 258 121 . . . = z(i) di e(i) For k bits of precision in results, k CORDIC iterations are needed, because tan 1 2i 2I for large i Start with

x = 1 and z = 0 to find tan 1 y Even though the computation above always converges, one can use the relationship tan 1(1/y ) = /2 tan 1y to limit the range of fixed-point numbers encountered Other trig functions: tan z obtained from sin z and cos z via division; inverse sine and cosine (sin 1 z and cos 1 z) discussed later May 2015 Computer Arithmetic, Function Evaluation Slide 45 22.3 CORDIC Hardware x(i+1) = x(i) di y(i) 2i x Shift Shift y z Lookup Table y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) di tan 1 2i (i) (i) If very is = zhigh dspeed i e not needed (as in a calculator), a single adder and one shifter would suffice k table entries for k bits of precision Fig. 22.3 Hardware elements needed for the CORDIC method. May 2015

Computer Arithmetic, Function Evaluation Slide 46 22.4 Generalized CORDIC x(i+1) = x(i) di y(i) 2i y(i+1) = y(i) + di x(i) 2i y z(i+1) = z(i) di e(i) =1 =0 B Circular rotations (basic CORDIC) e(i) = tan 1 2i Linear rotations e(i) = 2i A F E W V O x U C = 1 Hyperbolic rotations e(i) = tanh 1 2i D =1 =0 = 1 Fig. 22.4 Circular, linear, and hyperbolic CORDIC. May 2015 Computer Arithmetic, Function Evaluation

Slide 47 22.5 Using the CORDIC Method x(i+1) = x(i) di y(i) 2i y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) 0, d1}i e(i) {1, di {1, 1} K = 1.646 760 258 121 ... 1/K = .607 252 935 009 ... K' = .828 159 360 960 2 ... 1/K' = 1.207 497 067 763 ... Fig. 22.5 Summary of generalized CORDIC algorithms. May 2015 Mode =1 Circular e (i) = tan12 i =0 Linear e (i)= 2 i = 1 Hyperbolic Rotation: d i = sign(z (i) ), z (i) 0 x y z C O R D I C K(x cos z y sin z) K(y cos z + x sin z)

0 y z C O R D I C K x 2 + y2 0 z + tan1(y/x) For cos & sin, set x = 1/K, y = 0 tan z = sin z / cos z sin1w = tan1[w / 1 w2 ] x x x y + xz 0 y z y z C O R D I C C O R D I C

x 0 z + y/x For multiplication, set y = 0 For division, set z = 0 x x y z C O R D I C K' (x cosh z y sinh z) K' (y cosh z + x sinh z) 0 tanh z = sinh z / cosh z exp(z) = sinh z + cosh z w t = exp(t ln w) Note x For tan1, set x = 1, z = 0 cos 1w = tan1[ 1 w 2 / w] For cosh & sinh, set x = 1/K', y = 0 e (i) = tanh12i Vectoring: d i = sign(x(i) y(i) ), y (i) 0 y z C O

R D I C K' x2 y 2 0 z + tanh1(y/x) For tanh1, set x = 1, z = 0 ln w = 2 tanh1 |(w 1)/(w + 1)| w = (w + 1/4)2 (w 1/4)2 cosh1 w = ln(w + 1 w 2 ) sinh1 w = ln(w + 1 + w2 ) In executing the iterations for = 1, steps 4, 13, 40, 121, . . . , j , 3j + 1, . . . must be repeated. These repetitions are incorporated in the constant K' below. Computer Arithmetic, Function Evaluation Slide 48 CORDIC Speedup Methods x(i+1) = x(i) di y(i) 2i y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) di e(i) x(k) = x(k/2) y(k/2) z(k/2) y(k) = y(i) + x(k/2) z(k/2) z(k) = z(k/2) z(k/2) di {2, 1, 1, 2} or {2, 1, 0, 1, 2} May 2015 Skipping some rotations Must keep track of expansion via the recurrence: (K(i+1))2 = (K(i))2 (1 22i) This additional work makes variable-factor CORDIC less cost-effective than constant-factor CORDIC Early termination Do the first k/2 iterations as usual, then combine the remaining k/2 into a single multiplicative step: For very small z, we have tan1 z z tan z Expansion factor not an issue because contribution of the ignored terms is provably less than ulp High-radix CORDIC The hardware for the radix-4 version of CORDIC is quite similar to Fig. 22.3 Computer Arithmetic, Function Evaluation

Slide 49 22.6 An Algebraic Formulation Because cos z + j sin z = e jz where j = 1 cos z and sin z can be computed via evaluating the complex exponential function e jz This leads to an alternate derivation of CORDIC iterations Details in the text May 2015 Computer Arithmetic, Function Evaluation Slide 50 23 Variations in Function Evaluation Chapter Goals Learning alternate computation methods (convergence and otherwise) for some functions computable through CORDIC Chapter Highlights Reasons for needing alternate methods: Achieve higher performance or precision Allow speed/cost tradeoffs Optimizations, fit to diverse technologies May 2015 Computer Arithmetic, Function Evaluation Slide 51 Variations in Function Evaluation: Topics Topics in This Chapter 23.1 Normalization and Range Reduction 23.2 Computing Logarithms 23.3 Exponentiation 23.4 Division and Square-Rooting, Again 23.5 Use of Approximating Functions 23.6 Merged Arithmetic May 2015

Computer Arithmetic, Function Evaluation Slide 52 23.1 Normalization and Range Reduction u (i+1) = f(u (i), v (i)) v (i+1) = g(u (i), v (i)) Constant Desired function u (i+1) = f(u (i), v (i), w (i)) v (i+1) = g(u (i), v (i), w (i)) w (i+1) = h(u (i), v (i), w (i)) Guide the iteration such that one of the values converges to a constant (usually 0 or 1); this is known as normalization The other value then converges to the desired function Additive normalization: Normalize u via addition of terms to it Multiplicative normalization: Normalize u via multiplication of terms Additive normalization is more desirable, unless the multiplicative terms are of the form 1 2a (shift-add) or multiplication leads to much faster convergence compared with addition May 2015 Computer Arithmetic, Function Evaluation Slide 53 Convergence Methods You Already Know x (i+1) = x di y 2 (i) (i) y(i+1) = y(i) + di x(i) 2i z(i+1) = z(i) di e(i) i CORDIC Example of additive normalization Force y or z to 0 by adding terms to it

Force d to 1 by multiplying terms with it d (i+1) = d (i) (2 d (i)) Set d (0) = d; iterate until d (m) 1 z (i+1) = z (i) (2 d (i)) Set z (0) = z; obtain z/d = q z (m) Division by repeated multiplications Example of multiplicative normalization May 2015 Computer Arithmetic, Function Evaluation Slide 54 Range Reduction 2 3/2 /2 0 /2 3/2 2 CORDICs conv. domain 99.7 to 99.7 cos(z ) = cos z Adding to the argument flips the function sign cos(2j + z) = cos z Subtracting multiples of 2 from the argument does not

change the function value Must be careful: A slight error in the value of is amplified when a large multiple of 2 is added to, or subtracted from, the argument Example: Compute cos(1.125 247) Additive range reduction: see the CORDIC example above Multiplicative range reduction: applicable to the log function, e.g. May 2015 Computer Arithmetic, Function Evaluation Slide 55 23.2 Computing Logarithms di {1, 0, 1} x (i+1) = x (i) c (i) = x (i) (1 + di 2i) Force x (m) to 1 y (i+1) = y (i) ln c (i) = y (i) ln(1 + di 2i) y (m) converges to y + ln x 0 Read out from table Why does this multiplicative normalization method work? x (m) = x c (i) 1 c (i) 1/x y (m) = y ln c (i) = y ln (c (i)) = y ln(1/x) y + ln x Convergence domain: 1/(1 + 2i) x 1/(1 2i) or 0.21 x 3.45 Number of iterations: k, for k bits of precision; for large i, ln(1 2i) 2i Use directly for x [1, 2). For x = 2q s, we have: ln x = q ln 2 + ln s = 0.693 147 180 q + ln s May 2015 Computer Arithmetic, Function Evaluation Radix-4 version can be devised Slide 56

Computing Binary Logarithms via Squaring For x [1, 2), log2 x is a fractional number y = (. y1y2y3 . . . yl)two x = 2y = 2 (. y1y2y3 . . . yl)two l)two x 2 = 22y = 2 (y1. y2y3 . . . y y1 = 1 iff x 2 2 Once y1 has been determined, if y1 = 0, we are back at the original situation; otherwise, divide both sides of the equation above by 2 to get: x 2/2 = 2 (1 . y2y3 . . . yl)two Generalization to base b: x = b (. y1y2y3 . . . yl)two y1 = 1 iff x 2 b /2 = 2 (. y2y3 . . . yl)two value 2 iff this bit is 1 log2 x Fig. 23.1 Hardware elements needed for computing log2 x. May 2015 Computer Arithmetic, Function Evaluation Square r Initialized to x Radix Point Shift 0 1

Slide 57 23.3 Exponentiation Computing ex Read out from table x (i+1) = x (i) ln c (i) = x (i) ln(1 + di 2i) Force x (m) to 0 y (i+1) = y (i) c (i) = y (i) (1 + di 2i) y (m) converges to y ex 1 di {1, 0, 1} Why does this additive normalization method work? x (m) = x ln c (i) 0 ln c (i) x y (m) = y c (i) = y exp(ln c (i)) = y exp( ln c (i)) y ex Convergence domain: ln (1 2i) x ln (1 + 2i) or 1.24 x 1.56 Number of iterations: k, for k bits of precision; for large i, ln(1 2i) 2i Can eliminate half the iterations because ln(1 + ) = 2/2 + 3/3 . . . for 2 < ulp and we may write y (k) = y (k/2) (1 + x (k/2)) May 2015 Computer Arithmetic, Function Evaluation Radix-4 version can be devised Slide 58 General Exponentiation, or Computing xy x y = (e ln x) y = e y ln x So, compute natural log, multiply, exponentiate Method is prone to inaccuracies When y is an integer, we can exponentiate by repeated multiplication

(need to consider only positive y; for negative y, compute reciprocal) In particular, when y is a constant, the methods used are reminiscent of multiplication by constants (Section 9.5) Example: x 25 = ((((x)2x)2)2)2x [4 squarings and 2 multiplications] Noting that 25 = (1 1 0 0 1)two, leads to a general procedure Computing x y, when y is an unsigned integer Initialize the partial result to 1 Scan the binary representation of y, starting at its MSB, and repeat If the current bit is 1, multiply the partial result by x If the current bit is 0, do not change the partial result May 2015 Computer Arithmetic, Function Evaluation Slide 59 Square the partial result before the next step (if any) Faster Exponentiation via Recoding Example: x 31 = ((((x)2x)2x)2x)2x [4 squarings and 4 multiplications] Note that 31 = (1 1 1 1 1)two = (1 0 0 0 01)two x 31 = (((((x)2)2)2)2)2 / x [5 squarings and 1 division] Computing x y, when y is an integer encoded in BSD format Initialize the partial result to 1 Scan the binary representation of y, starting at its MSB, and repeat If the current digit is 1, multiply the partial result by x If the current digit is 0, do not change the partial result If the current digit is 1, divide the partial result by x Square the partial result before the next step (if any) Radix-4 example: 31 = (1 1 1 1 1)two = (1 0 0 0 01)two = (2 0 1)four x 31 = (((x2)4)4 / x May 2015

[Can you formulate the general procedure?] Computer Arithmetic, Function Evaluation Slide 60 23.4 Division and Square-Rooting, Again Computing q = z / d s (i+1) = s (i) (i) d q (i+1) = q (i) + (i) In digit-recurrence division, (i) is the next quotient digit and the addition for q turns into concatenation; more generally, (i) can be any estimate for the difference between the partial quotient q (i) and the final quotient q Because s (i) becomes successively smaller as it converges to 0, scaled versions of the recurrences above are usually preferred. In the following, s (i) stands for s (i) r i and q (i) for q (i) r i : s (i+1) = rs (i) (i) d Set s (0) = z and keep s (i) bounded q (i+1) = rq (i) + (i) Set q (0) = 0 and find q * = q (m) r m In the scaled version, (i) is an estimate for r (r im q q (i)) = r (r i q * - q (i)), where q * = r m q represents the true quotient May 2015 Computer Arithmetic, Function Evaluation Slide 61 Square-Rooting via Multiplicative Normalization Idea: If z is multiplied by a sequence of values (c (i))2, chosen so that the product z (c (i))2 converges to 1, then z c (i) converges to z x (i+1) = x (i) (1 + di 2i)2 = x (i) (1 + 2di 2i + di2 22i) x (0) = z, x (m) 1 y (i+1) = y (i) (1 + di 2i) y (0) = z, y (m) z What remains is to devise a scheme for choosing di values in {1, 0, 1} di = 1 for x (i) < 1 = 1 2i

di = 1 for x (i) > 1 + = 1 + 2i To avoid the need for comparison with a different constant in each step, a scaled version of the first recurrence is used in which u (i) = 2i (x (i) 1): u (i+1) = 2(u (i) + 2di) + 2i+1(2di u (i) + di2) + 22i+1di2 u (i) u (0) = z 1, u (m) 0 y (i+1) = y (i) (1 + di 2i) y (0) = z, y (m) z Radix-4 version can be devised: Digit set [2, 2] or {1, , 0, , 1} May 2015 Computer Arithmetic, Function Evaluation Slide 62 Square-Rooting via Additive Normalization Idea: If a sequence of values c (i) can be obtained such that z (c (i))2 converges to 0, then c (i) converges to z x (i+1) = z (y (i+1))2 = z (y (i) + c (i))2 = x (i) + 2di y (i) 2i di2 22i x (0) = z, x (m) 0 y (i+1) = y (i) + c (i) = y (i) di 2i y (0) = 0, y (m) z What remains is to devise a scheme for choosing di values in {1, 0, 1} di = 1 for x (i) < = 2i di = 1 for x (i) > + = + 2i To avoid the need for comparison with a different constant in each step, a scaled version of the first recurrence may be used in which u (i) = 2i x (i): u (i+1) = 2(u (i) + 2di y (i) di2 2i ) u (0) = z , u (i) bounded y (i+1) = y (i) di 2i y (0) = 0, y (m) z Radix-4 version can be devised: Digit set [2, 2] or {1, , 0, , 1} May 2015 Computer Arithmetic, Function Evaluation

Slide 63 23.5 Use of Approximating Functions Convert the problem of evaluating the function f to that of function g approximating f, perhaps with a few pre- and postprocessing operations Approximating polynomials need only additions and multiplications Polynomial approximations can be derived from various schemes The Taylor-series expansion of f(x) about x = a is f(x) = j=0 to f (j) (a) (x a) j / j! The error due to omitting terms of degree > m is: f (m+1) (a + (x a)) (x a)m+1 / (m + 1)! 0<<1 Setting a = 0 yields the Maclaurin-series expansion f(x) = j=0 to f (j) (0) x j / j! and its corresponding error bound: f (x) xm+1 / (m + 1)! Computer Arithmetic, Function Evaluation0 < < 1 (m+1) May 2015 Efficiency in computation can be gained via Horners method and incremental evaluation Slide 64 Some Polynomial Approximations (Table 23.1) Func Polynomial approximation Conditions 1/x 1 + y + y2 + y3 + .

ex 1 + x /1! + x 2/2! + x 3/3! + . . . + x i /i ! + . ln x y y 2/2 y 3/3 y 4/4 . . . y i /i ln x 2 [z + z 3/3 + z 5/5 + . sin x x x 3/3! + x 5/5! x 7/7! + . . . + (1)i x2i+1/(2i + 1)! + . . . cos x 1 x 2/2! + x 4/4! x 6/6! + tan1 x x x 3/3 + x 5/5 x 7/7 + . . . + (1)i x2i+1/(2i + 1) + . . . sinh x x + x 3/3! + x 5/5! + x 7/7! + . . . + x2i+1/(2i + 1)! + cosh x 1 + x 2/2! + x 4/4! + x 6/6! + . . . + x2i /(2i )! + May 2015

. . + yi + . . . 0 < x < 2, y = 1 x . . . . + z 2i+1/(2i + 1) + . . . . 0 < x 2, y = 1 x . . . . . x > 0, z = x1x+1 ] + (1)i x2i /(2i )! + . . . 1 < x < 1 . . . . . . Computer Arithmetic, Function Evaluation Slide 65 tanh1x x + x 3/3 + x 5/5 + x 7/7 + . . . + x2i+1/(2i + 1) + . . . 1 < x < 1 Function Evaluation via Divide-and-Conquer Let x in [0, 4) be the (l + 2)-bit significand of a floating-point number or its shifted version. Divide x into two chunks x H and x L: x = x H + 2t x L 0 xH < 4

t + 2 bits 0 xL < 1 l t bits t bits x H in [0, 4) x L in [0, 1) The Taylor-series expansion of f(x) about x = x H is f(x) = j=0 to f (j) (x H) (2t x L) j / j! A linear approximation is obtained by taking only the first two terms f(x) f (x H) + 2t x L f (x H) If t is not too large, f and/or f (and other derivatives of f, if needed) can be evaluated via table lookup May 2015 Computer Arithmetic, Function Evaluation Slide 66 Approximation by the Ratio of Two Polynomials Example, yielding good results for many elementary functions f(x) a(5)x5 + a(4)x4 + a(3)x3 + a(2)x2 + a(1)x + a(0) b(5)x5 + b(4)x4 + b(3)x3 + b(2)x2 + b(1)x + b(0) Using Horners method, such a rational approximation needs 10 multiplications, 10 additions, and 1 division May 2015 Computer Arithmetic, Function Evaluation Slide 67 23.6 Merged Arithmetic Our methods thus far rely on word-level building-block operations such as addition, multiplication, shifting, . . . Sometimes, we can compute a function of interest directly without breaking it down into conventional operations Example: merged arithmetic for inner product computation

z = z (0) + x (1) y (1) + x (2) y (2) + x (3) y (3) z(0) x(1) y(1) x(2) y(2) Fig. 23.2 Merged-arithmetic computation of x(3) y(3) an inner product followed by accumulation. May 2015 Computer Arithmetic, Function Evaluation Slide 68 Example of Merged Arithmetic Implementation z(0) (0) (1) (1) (2) (2) (3) (3) x(1) y(1) z=z +x y +x y +x y Fig. 23.2 x(2) y(2) Fig. 23.3 Tabular representation x(3) y(3) of the dot matrix for inner-product computation and its reduction.

1 4 7 10 13 10 7 4 16 FAs 2 4 6 8 8 6 4 2 10 FAs + 1 HA 3 4 4 6 6 3 3 1 9 FAs 1 2 3 4 4 3 2 1 1 4 FAs + 1 HA 1 3 2 3 3 2 1 1 1 3 FAs + 2 HAs 2

2 2 2 2 1 1 1 1 5-bit CPA Slide 69 May 2015 Computer Arithmetic, Function Evaluation Example: Inner product computation Another Merged Arithmetic Example Approximation of reciprocal (1/x) and reciprocal square root (1/x) functions with 29-30 bits of precision, so that a long floating-point result can be obtained with just one iteration at the end [Pine02] Doubleprecision significand 9 bits 1. 24 bits u 19 bits v w f(x) = c + bv + av 2 16 bits c Table b Table Table 30 bits 20 bits

12 bits a Radix-4 Booth 2 adds Squarer Radix-4 Booth Partial products gen 2 mults 1 square Partial products gen Multioperand adder Comparable to a multiplier 30 bits, carry-save May 2015 Computer Arithmetic, Function Evaluation Slide 70 24 Arithmetic by Table Lookup Chapter Goals Learning table lookup techniques for flexible and dense VLSI realization of arithmetic functions Chapter Highlights We have used tables to simplify or speedup q digit selection, convergence methods, . . . Now come tables as primary computational mechanisms (as stars, not supporting cast) May 2015 Computer Arithmetic, Function Evaluation Slide 71

Arithmetic by Table Lookup: Topics Topics in This Chapter 24.1 Direct and Indirect Table Lookup 24.2 Binary-to-Unary Reduction 24.3 Tables in Bit-Serial Arithmetic 24.4 Interpolating Memory 24.5 Piecewise Lookup Tables 24.6 Multipartite Table Methods May 2015 Computer Arithmetic, Function Evaluation Slide 72 24.1 Direct and Indirect Table Lookup Operand(s) u bits 2u by v table Operand(s) Preu bits processing logic . . . Smaller table(s) ... Result(s) v bits Postprocessing logic Fig. 24.1 Direct table lookup versus table-lookup with pre- and post-processing. May 2015 Computer Arithmetic, Function Evaluation

Result(s) v bits Slide 73 Tables in Supporting and Primary Roles Tables are used in two ways: In supporting role, as in initial estimate for division As main computing mechanism Boundary between two uses is fuzzy Pure logic Hybrid solutions Pure tabular Previously, we started with the goal of designing logic circuits for particular arithmetic computations and ended up using tables to facilitate or speed up certain steps Here, we aim for a tabular implementation and end up using peripheral logic circuits to reduce the table size Some solutions can be derived starting at either endpoint May 2015 Computer Arithmetic, Function Evaluation Slide 74 24.2 Binary-to-Unary Reduction Strategy: Reduce the table size by using an auxiliary unary function to evaluate a desired binary function Example 1: Addition/subtraction in a logarithmic number system; i.e., finding Lz = log(x y), given Lx and Ly Solution: Let = Ly Lx Ly Lz = log(x y) = log(x (1 y/x)) Lx = Ly Lx Preprocess = log x + log(1 y/x) = Lx + log(1 log 1)

Lx + +() May 2015 Lx + () Computer Arithmetic, Function Evaluation + table table Postprocess Lz Slide 75 Another Example of Binary-to-Unary Reduction Example 2: Multiplication via squaring, xy = (x + y)2/4 (x y)2/4 Simplification and implementation details If x and y are k bits wide, x + y and x y are k + 1 bits wide, leading to two tables of size 2k+1 2k (total table size = 2k+3 k bits) (x y)/2 x y = (x y)/2 + /2 x+y Preprocess (two adds) x y Square table Square table {0, 1} is the LSB (x + y)2/4 (x y)2/4 = [ (x + y)/2 + /2] 2 [ (x y)/2 + /2] 2 = (x + y)/2 2 (x y)/2 2 + y Pre-process: Table lookup: Post-process:

Can be realized with one adder and one table Postprocess (add) xy Fig. 24.2 compute x + y and x y; drop their LSBs consult two squaring table(s) of size 2k (2k 1) carry-save adder, followed by carry-propagate adder (table size after simplification = 2k+1 (2k 1) 2k+2 k bits) May 2015 Computer Arithmetic, Function Evaluation Slide 76 24.3 Tables in Bit-Serial Arithmetic From Memory Specified by 16-bit addresses a b Specified by 2-bit address c 3 bits specify a flag and a value to conditionalize the operation (64 Kb) 8-bit opcode (f truth table) f opcode Carry bit

for addition 0 0 0 1 0 1 1 1 0 1 2 3 4 5 6 7 Mux g op8-bit opcode code (g truth table) Fig. 24.3 Bit-serial ALU with two tables implemented as multiplexers. May 2015 Flags f(a, b, c) 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6

7 Specified by 2-bit address Mux Sum bit for addition g(a, b, c) Replaces a in memory To Memory Used in Connection Machine 2, an MPP introduced in 1987 Computer Arithmetic, Function Evaluation Slide 77 Other Table-Based Bit-Serial Arithmetic Examples See Section 4.3: Conversion from binary/decimal to RNS Evaluation of linear expressions (assume unsigned values) 0 xi z = ax + by = a xi 2i + b yi 2i b Address a yi a+b 4-entry table Sum Carry CS residual May 2015 Data

k / = (axi+ byi) 2 k / LSB k1 / CSA k1 k1 k1 / / / i zi xk1 x.k2 .. x2 x1 x0 Table Modular accumulator ... x mod m Fig. 24.4 Bit-serial evaluation of z = ax + by. Computer Arithmetic, Function Evaluation Slide 78 24.4 Interpolating Memory Linear interpolation: Computing f(x), x [xlo, xhi], from f(xlo) and f(xhi) f (x) = f (xlo) + x xlo xhi xlo

[ f (xhi) f (xlo) ] 4 adds, 1 divide, 1 multiply If the xlo and xhi endpoints are consecutive multiples of a power of 2, the division and two of the additions become trivial 1 Example: Evaluating log2 x for x [1, 2) f(xlo) = log2 1 = 0, f(xhi) = log2 2 = 1; thus: log2 x x 1 = Fractional part of x An improved linear interpolation formula log2 x May 2015 0 1 ln 2 ln(ln 2) 1 + (x 1) = 0.043 036 + x 2 ln 2 Computer Arithmetic, Function Evaluation 2 Slide 79 Hardware Linear Interpolation Scheme a Improved linear approximation a + b x b Initial linear approximation f(x) Add x x lo Multiply x

x x hi f(x) Fig. 24.5 Linear interpolation for computing f(x) and its hardware realization. May 2015 Computer Arithmetic, Function Evaluation Slide 80 Linear Interpolation with Four Subintervals 4-entry tables (i) (i) 2-bit address a + b x i=0 i=1 i=3 i=2 f(x) Multiply Add x Table 24.1 Approximating log2 x for x in [1, 2) using linear interpolation within 4 subintervals. May 2015 b (i) /4

4x x x min a(i) x x max Fig. 24.6 Linear interpolation for computing f(x) using 4 subintervals. f(x) i xlo xhi a (i) b (i)/4 Max error 0 1.00 1.25 0.004 487 0.321 928 0.004 487 1 1.25 1.50 0.324 924 0.263 034 0.002 996 2 1.50 1.75 0.587 105 0.222 392 0.002 142 3 1.75 2.00 0.808 962 0.192 645 0.001 607 Computer Arithmetic, Function Evaluation Slide 81 Tradeoffs in Cost, Speed, and Accuracy 1 10

Fig. 24.7 Maximum absolute error in computing log2 x as a function of number h of address bits for the tables with linear, quadratic (second-degree), and cubic (third-degree) interpolations [Noet89]. 2 Worst-case absolute error 10 3 10 4 10 Linear 5 10 Secondorder 6 10 Thirdorder 7 10 8

10 9 10 0 2 4 6 8 10 Number of bits (h) May 2015 Computer Arithmetic, Function Evaluation Slide 82 Interpolation with Nonuniform Intervals One way to use interpolation with nonuniform intervals to successively divide ranges and subranges of interest into 2 parts, with finer divisions used where the function exhibits greater curvature (nonlinearity) In this way, a number of leading bits can be used to decide which subrange is applicable The [0,1) range divided into 4 nonuniform intervals .0xx .10x .110 .111 0 May 2015 Computer Arithmetic, Function Evaluation 1

Slide 83 24.5 Piecewise Lookup Tables To compute a function of a short (single) IEEE floating-point number: Divide the 26-bit significand x (2 whole + 24 fractional bits) into 4 sections x = t + u + v + w = t + 26u + 212v + 218w 2 3 t u v w where u, v, w are 6-bit fractions in [0, 1) and t, with up to 8 bits, is in [0, 4) Taylor polynomial for f(x): f(x) = i=0 to f (i) (t + u) (2v + 3w)i / i ! Ignore terms smaller than 5 = 230 Use 4 additions to form these terms Read 5 values of f from tables Read this last term from a table f(x) f(t + u) + (/2) [f(t + u + v) f(t + u v)] Perform 6-operand addition + (2/2) [f(t + u + w) f(t + u w)] + 4 [(v 2/2) f (2)(t) Computer (v 3/6)Arithmetic, f (3)(t)] Function Evaluation May 2015 Slide 84 Modular Reduction, or Computing z mod p Divide the argument z into a (b g)-bit upper part (x) and a g-bit lower part (y), where x ends with g zeros z

b-bit input bg g d d Table 1 Table 2 vH d vL d Adder d+1 p (x + y) mod p = (x mod p + y mod p) mod p Adder d+1 Fig. 24.8a Two-table modular reduction scheme based on divide-and-conquer. May 2015 d Sign d-bit output Computer Arithmetic, Function Evaluation d + Mux z mod p

Slide 85 Another Two-Table Modular Reduction Scheme Divide the argument z into a (b h)-bit upper part (x) and an h-bit lower part (y), where x ends with h zeros b-bit input z bh h Table 1 d*h h Adder Explanation to be added Fig. 24.8b Modular reduction based on successive refinement. May 2015 d* d* d-bit output Computer Arithmetic, Function Evaluation v d* d Table 2 m* z mod p

Slide 86 24.6 Multipartite Table Methods k-bit input x kab b a x2 k-bit output y v(x0, x2) Subintervals v Table Add x1 f(x) Common Slope u Table x0 f(x) An interval u(x0, x1) (a) Hardware realization (b) Linear approximation Divide the domain of interest into 2a intervals, each of which is further divided into 2b smaller subintervals Fig. 24.9 The bipartite table method. The trick: Use linear interpolation

with an initial value determined for each subinterval and a common slope for each larger interval Total table size is 2a+b + 2kb, in lieu of 2k; width of table entries has been ignored in this comparison May 2015 x Computer Arithmetic, Function Evaluation Slide 87 Generalizing to Tripartite and Higher-Order Tables Two-part tables have been generalized to multipart (3-part, 4-part, . . . ) tables Source of figure: www.ens-lyon.fr/LIP/Arenaire/Ware/Multipartite/ May 2015 Computer Arithmetic, Function Evaluation Slide 88

## Recently Viewed Presentations

• Engaging Students Who are At Risk Through Addressing Gaps in Academic Skill and Accelerate Learning Practice 3 looks at understanding and addressing any academic weakness and skill gaps that system-involved youth may have and going beyond simple remediation by providing...
• Grant v Mitie Property (2009) M. cKinley v . SoS. Defence (2004) RBS v Goudie (2003) City of Edinburgh v Dickson (2009) Other cases on internet use. Employers should tell employees: Circumstances of monitoring. When. What. How. How used. Limit...
• Application of circular-arc graphs to the traffic light phasing problem Feasible green light assignment A feasible green light assignment consist of an assignment of an arc of the circle to each traffic stream so that only compatible streams are allowed...
• A penny has JUST been released from the hand of a mischievous youth leaning over the edge of the observation deck of the Empire State Building. The penny has fallen for long enough that it is at "terminal velocity" A...
• Everyone knows that Google uses an algorithm that ranks websites based on many factors including inbound links, page rank, relevant text, anchor tags, etc. However, no one except perhaps Larry Page and Sergey Brin, the founders of Google, know the...
• Chapter 9- Aims, Goals, Objectives Objections to behavioral objectives they only emphasize the teaching of facts at he expense of more complicated intellectual behaviors they place a sameness on the curriculum, assuming all must master identical material and do so...
• Conclusion: routing changes occur over a wide range of time scales, i.e., from minutes to days Routing Symmetry Sources of Routing Asymmetry Link cost metrics "hot potato" routing problem due to the competing providers. "cold potato" Routing Symmetry Analysis of...
• Learning English as a Foreign Language in Developing Regions Matthew Kam* | Divya Ramachandran* Jane Chiu† | Anand Raghavan* University of California at Berkeley, USA