A novel quotient prediction for floating-point division

At present, floating-point operations are used as add-on functions in critical embedded systems,

such as physics, aerospace system, nuclear simulation, image and digital signal processing, automatic

control system and optimal control and financial, etc. However, floating-point division is slower than

floating-point multiplication. To solve this problem, many existing works try to reduce the required

number

of iterations, which exploit large Look Up Table (LUT) resource to achieve approximate mantissa of a

quotient. In this paper, we propose a novel prediction algorithm to achieve an optimal quotient by

predicting certain bits in a dividend and a divisor, which reduces the required LUT resource. Therefore,

the final quotient is achieved by accumulating all predicted quotients using our proposed prediction

algorithm. The experimental results show that only 3 to 5 iterations are required to obtain the final

quotient in a floating-point division computation. In addition, our proposed design takes up 0.84% to

3.28% (1732 LUTs to 6798 LUTs) and 5.04% to 10.08% (1916 (ALUT) to 3832 (ALUT)) when ported to

Xilinx Virtex-5 and Altera Stratix-III FPGAs, respectively. Furthermore, our proposed design allows

users to track remainders and to set customized thresholds of these remainders to be compatible with a

specific application.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

Tải về để xem bản đầy đủ

14 trang xuanhieu 9500

Download

Bạn đang xem 10 trang mẫu của tài liệu "A novel quotient prediction for floating-point division", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: A novel quotient prediction for floating-point division

esults are 
© 2019 Trường Đại học Công nghiệp thành phố Hồ Chí Minh 
 A NOVEL QUOTIENT PREDICTION FOR FLOATING-POINT DIVISION 43 
obtained under two cases: (1) left significant 5-bit of the dividend and the divisor ; (2) left 
significant 7-bit of the dividend and the divisor ;. Table 1 highlights the differences among 
frequency, the number of slices and LUTs. In addition, three formats i.e. single precision, double 
precision and double extended precision, are used in these implementations. 
 From Table 1, when the length of mantissa, exponent and LUT size increases, the occupied area 
becomes larger and the frequency decreases. However, the increasing degree of area is insignificant, 
because the proposed design occupies 139 to 591 slices (1732 to 7687 LUTs). 
 Table 2 shows the frequency, the required number of clock cycles for one iteration as well as the 
occupied slices and LUTs of our proposed designs on XC5VLX330 FPGA. The results are obtained with 
pairs of the dividend and the divisor using different remainders. For example, mantissa = 1, 
exponent = -5, -10, -15, - 20 and the pre-set remainder = 0.003125, 0.00097656, 0.000030518, 
0.0000009536. 
 Table 2: Latencies of 5-bit and 7-bit of the dividend and the divisor for prediction on XC5VLX330 
 Remainder Frequency No.Iterations Clock Area Area 
 Exponents (MHz) (average) cycles Slices LUTs 
 P5 P7 P5 P7 P5 P7 P5 P7 P5 P7 
 - 5 121 112 4 3 5 5 139 591 1729 7687 
 - 10 121 110 5 4 5 5 139 591 1729 7688 
 - 15 120 111 5 4 5 5 141 595 1735 7695 
 - 20 120 110 5 4 5 5 141 595 1736 7695 
 P5: 5-bit prediction; P7: 7-bit prediction 
 Figure 4: The number of iterations to reach different quotient (pairs of dividend and divisor are randomly selected) 
 (a) Q=0.161247; (b) Q= 5.377; (c) Q= 11.059639; (d) Q= 11.615; (e) Q= 13.94482421875; (f) Q= 26.18; (g) Q= 
 48.485; (h) Q= 82.05; (i) Q= 176.992; (m) Q= 185.852416; (n) Q= 189.255876608; (k) Q= 378.55. 
 © 2019 Trường Đại học Công nghiệp thành phố Hồ Chí Minh 
44 A NOVEL QUOTIENT PREDICTION FOR FLOATING-POINT DIVISION 
 Figure 4 shows the relationship between the different obtained quotient values and the number of 
iteration. In Figure 4 from (a) to (l), we randomly choose the pairs of the dividend and the divisor 
based on test vector sets as shown in Table 2. It is obvious that the computed quotient value is close to the 
required quotient value in the first iteration. It could reach the optimal condition after the second iteration 
and remains stable for the rest iterations. 
 From Figure 4 and Table 2, we can draw a conclusion that approximately 3 to 5 iterations on the 
average are needed to obtain the final quotient. A significant speedup of convergence is achieved in the 
first two iterations of the computation. The speed of this convergent slows down or remains stable from 
the third iteration or the fourth iteration onwards. For example, the results of 10 divided by 3 can be 
3.3333, 3.333333, or 3.3333333, which depends on the required precision. If the dividend is larger than 
the divisor, the speed of convergence to obtain the final quotient is faster. In the case that the pre-set 
reminder is small, the number of iterations to reach the stable state is large, which results in the longer 
computational time. 
 Table 3 shows the implementation results of our proposed algorithms on XC5VLX330 and 
EP3SE50F484C2 FPGAs, respectively. In this particular test, the dividend is 0.100001111.27, the divisor 
is 0.10010001.25, the required remainder mantissa is 1, and the required remainder‟s exponent is -5. The 
results show that the proposed design takes up 0.84% to 3.28% (1732 LUTs to 6798 LUTs) and 5.04% to 
10.08% (1916 (ALUT) to 3832 (ALUT)) on XC5VLX330 and EP3SE50F484C2 FPGAs, respectively. 
Table 3: The implementation results of the floating-point division on XC5VLX330 and EP3SE50F484C2 
 Single precision Double precision Double extended 
Platforms Virtex-5 Stratix III Virtex-5 Stratix III Virtex-5 Stratix III 
Frequency (MHz) 193 179 162 160 162 158 
Number of Slices 139 1916 301 2805 548 3832 
 (ALUT) (ALUT) (ALUT) 
 (0.07%) (5.04%) (0.15%) (7.38%) (0.26%) (10.8%) 
Number of LUTs 1732 1916 3728 2805 6798 3832 
 (ALUT) (ALUT) (ALUT) 
 (0.84%) (5.04%) (1.80%) (7.38%) (3.28%) (10.08%) 
i (iterations) 5 5 5 5 5 5 
Total time (ns) 155 196 182 230 217 263 
 Table 4 shows a comparison between our proposed algorithm and existing floating-point divisions 
for double precision. It is quite hard to make a fair comparison due to different algorithms and different 
platforms used in the existing works. Therefore, we focus on the comparison of number of iterations. The 
maximum numbers of iterations in [5] and [25] with the non-restoring algorithm and digit-current 
algorithm are 29 and 55. Compared to them, 13.8% and 54% reduction on maximum number of iterations 
has been achieved by our algorithm. The maximum number of iterations used in [7] in with Newton 
Raphson method is 31. Compare with this, our proposed algorithm reduces 19.4% number of iterations. 
The maximum number of iterations in [8] with SRT method is 40. Our proposed algorithm only requires 
25 iterations. In the worst case, the maximum number of iterations of our proposed algorithm is 20% 
larger than the one in [17] and the same in [6]. This also proves that our proposed algorithm is able to 
overcome shortcomings of the SRT method, which can efficiently reduce the number of iterations and 
computational latency. The proposed design takes up 139 to 548 slices on Xilinx Virtex-5 FPGA, which 
is only 50% to 74% of the designs in [5] and [25] and reduces number of iterations with [26] using CR 
algorithm. 
© 2019 Trường Đại học Công nghiệp thành phố Hồ Chí Minh 
 A NOVEL QUOTIENT PREDICTION FOR FLOATING-POINT DIVISION 45 
 Table 4: Latency comparisons between this work and previous works 
Works Algorith- Platform Area Total Cycle Latency 
 ms of Time (ns) (ns) 
 Iteratio
 -nns 
P.Echeverri‟a NRA Virtex-4 742 (Slices)** 24 ~ 29 3.6 ~ 2.9 85.7 ~ 
[5] 114.4 
M. Schulte [6] GA K7 (FPU) - 14 ~ 26 - - 
 GA GS1 (FPU) - 11 ~ 25 - - 
P. Soderquist [7] SRT-8/16 - - 8 ~ 15 7.14 57 ~ 107 
 NRM PA7200(FP - 14 ~ 19 7.14 107 
 U) 
 NRM PA8000(FP - 31 5 155 
 U) 
S. Oberman [8] SRT – 1 SPECfp92 - > 40 - - 
 SRT- SPECfp92 - 4 ~ 40 - - 
 2/4/8 
T. Lang [17] SRT - 10 CMOS std - 20 1 20 
 (90-nm) 
M. Baesler [25] DRA Virtex-5 55 ~ 261 1 ~ 55 153.7 ~ 6.8 153.7 ~ 374 
 (Slices)* 
Björn Liebig CR XC5VFX20 - 10 ~ 57 5 - 
[26] 0T-1 
This work PQC Virtex-5 139 (Slices)* 5 ~ 25 6.2 ~ 5.2 155 ~ 217 
 548 (Slices)** 
 Stratix III 1916 (ALUTs)* 5 ~ 25 7.8 ~ 6.2 196 ~ 263 
 3832 
 (ALUTs)** 
 -: not supported; *: Single precision; **: Double extended 
 NRA: Non-restoring Algorithm; GA: Goldschmidt Algorithm 
 SRT: The high-radix Sweeney, Robertson and Tocher Algorithm 
 NRM: Newton-Raphson Method; DRA: Digit-Recurrent Algorithm 
 PQC: Predicting the quotient‟s coefficient by LUT 
5.2 CONCLUSIONS 
 The floating-point division is the most complicated computations among four floating-point 
operations, such as addition, subtraction, multiplication and division. In order to reduce the required 
number of iterations, we focus on the acceleration to obtain the final quotient using the prediction of the 
quotient at each iteration. In our proposed algorithm, only 3 to 5 iterations are needed in order to reach 
final quotient in the floating-point division computation. The major advantage of our algorithm is that it is 
independent on different formats of floating-point number. Moreover, our proposed design utilizes FMA 
function, which has the advantages of obtaining the remainder easily, avoiding ”Normalize” step, and 
reducing effort in coding. The experimental results show that the proposed design only occupies 0.07% to 
0.26% (139 slices to 548 slices) and 5.04% to 10.08% (1916 (ALUT) to 3832 (ALUT)) on Virtex-5 and 
Stratix-III, respectively. Furthermore, our proposed design reduces the maximum number of iterations to 
obtain the final quotient, with 26% to 50% reduction of the occupied area compared to the state-of-the-art 
works. 
 © 2019 Trường Đại học Công nghiệp thành phố Hồ Chí Minh 
46 A NOVEL QUOTIENT PREDICTION FOR FLOATING-POINT DIVISION 
REFERENCES 
[1] I. Koren, Computer Arithmetic Algorithms, AK Peters Ltd, 2002. 
[2] J.-M. Muller, N. Brisebarre, F.-D. Dinechin, C.-P. Jeannerod, V. Lefevre, G. Melquiond, N. Revol, D. Stehle, 
and S. Torres, Handbook of FloatingPoint Arithmetic, Boston-Basel-Berlin, United States, 2009. 
[3] T. Lang and M. D.Ercegovac, Digital Arithmetic, Morgan Kaufmann Publishers, 2004. 
[4]  2013. 
[5] P. Echeverr´ia and M. L´ opez-Vallejo, Customizing floating-point units for fpgas: Area-performance-standard 
trade-offs, Microprocessors and Microsystems, Available: www.elsevier.com/locate/micpro, vol. 35, pp. 535–546, 
2011. 
[6] M. Schulte, D. Tan, and C. Lemonds, Floating-point division algorithms for an x86 microprocessor with a 
rectangular multiplier, in Computer Design, 2007. ICCD 2007. 25th International Conference on, October 2007, 
pp.304–310 
[7] P. Soderquist and M. Leeser, Division and square root: choosing the right implementation, IEEE Micro, vol. 17, 
no. 4, pp. 56 –66, July/August 1997. 
[8] S. Oberman and M. Flynn, Design issues in division and other floating-point operations, IEEE Transactions on 
Computers, vol. 46, no. 2, pp. 154 –161, February 1997. 
[9] S. Obermann and M. Flynn, Division algorithms and implementations, IEEE Transactions on Computers, vol. 
46, no. 8, pp. 833 –854, August 1997. 
[10] J.-A. Pineiro and J. Bruguera, High-speed double-precision computation of reciprocal, division, square root, 
and inverse square root, IEEE Transactions on Computers, vol. 51, no. 12, pp. 1377 – 1388, December 2002. 
[11] D. Wong and M. Flynn, Fast division using accurate quotient approximations to reduce the number of 
iterations, IEEE Transactions on Computers, vol. 36, pp. 850–863, 1992. 
[12] A. Amin and W. Shinwari, High-radix multiplier-dividers: Theory, design, and hardware, IEEE Transactions on 
Computers, vol. 59, no. 8, pp. 1009–1022, August 2010. 
[13] T.-J. Kwon and J. Draper, Floating-point division and square root using a taylor-series expansion algorithm, 
Microelectronics Journal, Available: www.elsevier.com/locate/mejo, vol. 40, pp. 1601–1605, 2009. 
[14] N. Brisebarre, J.-M. Muller, and S. K. Raina, Accelerating correctly rounded floating-point division when the 
divisor is known in advance, IEEE Transactions on Computers, vol. 53, no. 8, pp. 1069 – 1072, August 2004. 
[15] X. Wang and B. Nelson, Tradeoffs of designing floating-point division and square root on virtex fpgas, in 
Field-Programmable Custom Computing Machines, 2003. FCCM 2003. 11th Annual IEEE Symposium on, 2003, 
pp. 195–203. 
[16] B. P. H. Nikmehr and C. Limb, A novel implementation of radix-4 foating-point division/square-root using 
comparison multiples, Computers and Electrical Engineering, Available: www.elsevier.com/locate/compeleceng, 
vol. 36, pp. 850–863, 2010. 
[17] T. Lang and A. Nannarelli, A radix-10 digit-recurrence division unit: Algorithm and architecture, IEEE 
Transactions on Computers, vol. 56, no. 6, pp.727 –739, June 2007. 
[18] W. Liu and A. Nannarelli, Power efficient division and square root unit, IEEE Transactions on Computers, vol. 
61, no. 8, pp. 1059 –1070, August 2012. 
© 2019 Trường Đại học Công nghiệp thành phố Hồ Chí Minh 
 A NOVEL QUOTIENT PREDICTION FOR FLOATING-POINT DIVISION 47 
[19] K. Quinn., The newton raphson algorithm for function optimization. Department of Political Science and The 
Center for Statistics and the Social Sciences, pp. 364–384, October 2001. 
[20] A. Y. S. Lam and V. O. K. Li, Chemical Reaction Optimization: a tutorial, 2012. 
[21] T. Pham, Y. Wang, and R. Li, A variable-latency floating-point division in association with predicted quotient 
and fixed remainder, in Circuits and Systems (MWSCAS), 2013 IEEE 56th International Midwest Symposium on, 
2013, pp. 1240–1245. 
[22] A. Amaricai, M. Vladutiu, and O. Boncalo, Design issues and implementations for floating-point divide - add 
fused, Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 57, no. 4, pp. 295 –299, April 2010. 
[23] S. Boldo and J.-M. Muller, Exact and approximated error of the fma, IEEE Transactions on Computers, vol. 60, 
no. 2, pp. 157 –164, February 2011. 
[24] L. Huang, S. Ma, L. Shen, Z. Wang, and N. Xiao, Low-cost binary128 floating-point fma unit design with simd 
support, IEEE Transactions on Computers, vol. 61, no. 5, pp. 745 –751, May 2012. 
[25] M. Baesler, S. Voigt, and T. Teufel, Fpga implementations of radix-10 digit recurrence fixed-point and 
floating-point dividers, in Reconfigurable Computing and FPGAs (ReConFig), 2011 International Conference on, 
December 2011, pp. 13 –19. 
[26] Björn Liebig, Andreas Koch, Low-Latency Double-Precision Floating-PointDivision for FPGAs, 2014 
International Conference on Field-Programmable Technology (FPT), pp. 25 – 32. 
 MỘT CẢI TIẾN CHO SỰ ƯỚC LƯỢNG THƯƠNG SỐ CHO PHÉP TOÁN CHIA 
 SỐ DẤU CHẤM ĐỘNG 
Tóm tắt. Ngày nay, những phép tính số dấu chấm động được sử dụng như những hàm bổ trợ trong các hệ 
thống nhúng tư duy ứng dụng trong các lĩnh vực như vật lý, hệ thống hàng không vũ trụ, mô phỏng hạt 
nhân, xử lý tín hiệu hình ảnh và kỹ thuật số, hệ thống điều khiển tự động và điều khiển tối ưu và tài chính, 
v.v. Tuy nhiên, phép toán chia số dấu chấm động chậm hơn so với phép toán nhân số dấu chấm động. Để 
giải quyết vấn đề này, đã có nhiều nghiên cứu để giảm số lượng vòng lặp cần thiết để ra thương số bằng 
việc dùng tài nguyên bảng tra (LUT) để đạt tới xấp sỉ gần giá trị thương số. Trong bài báo này, chúng tôi 
đề xuất một thuật toán ước lượng cải tiến để đạt đến một thương số tối ưu bằng sự tiên đoán những bit 
nhất định trong số chia và số bị chia, giảm tài nguyên LUT cần thiết. Do đó, thương số cuối cùng đạt 
được bằng cách tích lũy tất cả các thương số tiên đoán của giải thuật tiên đoán của chúng tôi đề xuất. Kết 
quả thực nghiệm cho thấy chỉ cần từ 3 đến 5 vòng lặp để có được thương số cuối cùng trong phép chia số 
dấu chấm động. Thêm nữa, thiết kế được đề xuất chiếm 0.84% đến 3.28% (1732 LUTs đến 6798 LUTs) 
và 5.04% đến 10.08% (1916 (ALUT) đến 3832 (ALUT)) khi được cài đặt trên chip Xilinx Virtex-5 và 
Altera Stratix-III FPGAs tương ứng. Hơn nữa, thiết kế đề xuất cho phép những người sử dụng theo dõi 
phần dư để đặt ngưỡng tùy chỉnh cho số dư tương thích với những ứng dụng chuyên biệt của người sử 
dụng. 
Từ khóa. Số dấu chấm động, phép toán chia số dấu chấm động, đơn vị xử lý số dấu chấm động (FPU), 
FPGA, bảng tra (LUT), hệ thống nhúng. 
 Ngày nhận bài:08/08/2019 
 Ngày chấp nhận đăng:25/10/2019 
 © 2019 Trường Đại học Công nghiệp thành phố Hồ Chí Minh

File đính kèm:

a_novel_quotient_prediction_for_floating_point_division.pdf