LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code
Folks,
I'm pitting LW 5.0 on Mac Intel against ACL 8.0 in a final round of
pre-purchase checks.
I have some heavy-duty floating-point computations. The C function
looks like this:
double digamma(double x)
{
double p;
x=x+6;
p=1/(x*x);
p=(((0.004166666666667*p-0.003968253986254)*p+
0.008333333333333)*p-0.083333333333333)*p;
p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
return p;
}
GCC generates about 200 assembler instructions for the code above.
This is the Lisp code I came up with using Duane Rettig's (Franz) help:
(defun digamma/setq-x (x)
(declare (:explain :types :inlining :variables)
(optimize (speed 3) (safety 0) (debug 0))
((double-float (0.0d0) *) x))
(incf x 6d0)
(let* ((p (/ 1d0 (* x x)))
(logx (log x))
(one 1d0))
(declare (double-float one p))
(setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
0.003968253986254d0))
0.008333333333333d0) p)
0.083333333333333d0) p))
(+ p (- logx
(/ 0.5d0 x)
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))))
))
This generates 361 instructions [1]. I tweaked the same function for
LispWorks like this:
(defun digamma/setq-x (x)
(declare (optimize (speed 3) (safety 0) (debug 0) (float 0))
(type (double-float (0.0d0) *) x))
(incf x 6d0)
(let* ((p (/ 1d0 (* x x)))
(logx (log x))
(one 1d0))
(declare (type (double-float one p)))
(setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
0.003968253986254d0))
0.008333333333333d0) p)
0.083333333333333d0) p))
(+ p (- logx
(/ 0.5d0 x)
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))
(/ one (setq x (- x one)))))
))
I'm not sure if (declare ((double-float (0.0d0) *) x)) is just as
valid for LW. Making it (declare (type (double-float (0.0d0) *) x))
increases instruction count to 781 from 541 and I start seeing calls
to SYSTEM::RAW-FAST-BOX-DOUBLE in the assembler listing which (I
suppose) is an improvement [2].
Does the disassembly comment below say that this is a generic version
of the function?
; #<Function SYSTEM::*%+$ANY-STUB 2010FFFA>
It seems that LW is faster (stats below) despite almost 2x the number
of instructions. Am I correct in this assessment? I'm no assembler
expert but would this be because of better pipelining, etc.?
Did I get the declare above right? Did I miss any optimization
annotations for LW?
Last but not least, the only way I was able to arrive at the code
above is by using (declare (:explain :types :inlining :variables))
in ACL 8.0. This tells me what is being boxed and what is not, what
can be inlined, what is being held in registers, etc.
I understand there's no such help provided by LW 5.0. How do you
folks approach iterative optimization without any feedback from the
compiler?
The stats for LW 5.0 are:
(time (loop for i from 10000 to 1000000 do (digamma/setq-x i)))
Timing the evaluation of (LOOP FOR I FROM 10000 TO 1000000 DO
(DIGAMMA/SETQ-X 10000))
User time = 3.275
System time = 0.015
Elapsed time = 4.593
Allocation = 403946972 bytes
2 Page faults
Calls to %EVAL 14850061
The stats for ACL:
; cpu time (non-gc) 5,470 msec user, 10 msec system
; cpu time (gc) 520 msec user, 0 msec system
; cpu time (total) 5,990 msec user, 10 msec system
; real time 6,015 msec
; space allocation:
; 13,860,170 cons cells, 213,841,600 other bytes, 0 static bytes
[1] ACL 8.0 disasssemble
0: pushl ebp
1: movl ebp,esp
3: pushl esi
4: subl esp,$68
7: movsd xmm5,[eax-10]
12: movl ebx,[esi+18] ; 6.0d0
15: movsd xmm4,[ebx-10]
20: addsd xmm5,xmm4
24: movl ebx,[esi+22] ; 1.0d0
27: movsd xmm4,[ebx-10]
32: movsd xmm3,xmm5
36: mulsd xmm3,xmm5
40: divsd xmm4,xmm3
44: movsd [ebp-32],xmm4
49: movsd [ebp-40],xmm5
54: movsd xmm7,xmm5
58: xorl ecx,ecx
60: call *[edi+531] ; SYS::NEW-DOUBLE-FLOAT
66: movl ebx,[esi+26] ; LOG
69: movb cl,$1
71: call *edi
73: movsd xmm5,[eax-10]
78: movl ebx,[esi+22] ; 1.0d0
81: movsd xmm4,[ebx-10]
86: movl ebx,[esi+30] ; 0.004166666666667d0
89: movsd xmm3,[ebx-10]
94: movsd xmm2,[ebp-32]
99: mulsd xmm3,xmm2
103: movl ebx,[esi+34] ; 0.003968253986254d0
106: movsd [ebp-48],xmm5
111: movsd xmm5,[ebx-10]
116: movsd xmm6,xmm3
120: subsd xmm6,xmm5
124: movsd xmm5,xmm6
128: mulsd xmm5,xmm2
132: movl ebx,[esi+38] ; 0.008333333333333d0
135: movsd xmm3,[ebx-10]
140: addsd xmm5,xmm3
144: mulsd xmm5,xmm2
148: movl ebx,[esi+42] ; 0.083333333333333d0
151: movsd xmm3,[ebx-10]
156: subsd xmm5,xmm3
160: mulsd xmm2,xmm5
164: movl ebx,[esi+46] ; 0.5d0
167: movsd xmm5,[ebx-10]
172: movsd xmm3,[ebp-40]
177: divsd xmm5,xmm3
181: movsd [ebp-32],xmm2
186: movsd xmm2,[ebp-48]
191: movsd xmm6,xmm2
195: subsd xmm6,xmm5
199: movsd xmm5,xmm6
203: subsd xmm3,xmm4
207: movsd xmm2,xmm3
211: movsd xmm6,xmm4
215: divsd xmm6,xmm3
219: movsd xmm3,xmm6
223: subsd xmm5,xmm3
227: subsd xmm2,xmm4
231: movsd xmm3,xmm2
235: movsd xmm6,xmm4
239: divsd xmm6,xmm2
243: movsd xmm2,xmm6
247: subsd xmm5,xmm2
251: subsd xmm3,xmm4
255: movsd xmm2,xmm3
259: movsd xmm6,xmm4
263: divsd xmm6,xmm3
267: movsd xmm3,xmm6
271: subsd xmm5,xmm3
275: subsd xmm2,xmm4
279: movsd xmm3,xmm2
283: movsd xmm6,xmm4
287: divsd xmm6,xmm2
291: movsd xmm2,xmm6
295: subsd xmm5,xmm2
299: subsd xmm3,xmm4
303: movsd xmm2,xmm3
307: movsd xmm6,xmm4
311: divsd xmm6,xmm3
315: movsd xmm3,xmm6
319: subsd xmm5,xmm3
323: subsd xmm2,xmm4
327: divsd xmm4,xmm2
331: subsd xmm5,xmm4
335: movsd xmm4,[ebp-32]
340: addsd xmm5,xmm4
344: movsd xmm7,xmm5
348: xorl ecx,ecx
350: call *[edi+531] ; SYS::NEW-DOUBLE-FLOAT
356: clc
357: leave
358: movl esi,[ebp-4]
361: ret
[2] Disassembly for LW 5.0:
200A56EA:
0: 55 push ebp
1: 89E5 move ebp, esp
3: 83EC14 sub esp, 14
6: C7042486140000 move [esp], 1486
13: 50 push eax
14: 89C7 move edi, eax
16: DD4705 fldl [edi+5]
19: DD5DF8 fstpl [ebp-8]
22: 660F126DF8 movlpd xmm5, [ebp-8]
27: 660F123550560A20 movlpd xmm6, [200A5650] ; 6.0D0
35: F20F58F5 addsd xmm6, xmm5
39: 660F1375F8 movlpd [ebp-8], xmm6
44: 660F1275F8 movlpd xmm6, [ebp-8]
49: 660F126DF8 movlpd xmm5, [ebp-8]
54: F20F59EE mulsd xmm5, xmm6
58: 660F123560560A20 movlpd xmm6, [200A5660] ; 1.0D0
66: F20F5EF5 divsd xmm6, xmm5
70: 83EC0C sub esp, C
73: C70424860C0000 move [esp], C86
80: 660F13742404 movlpd [esp+4], xmm6
86: B501 moveb ch, 1
88: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
94: 8945E8 move [ebp-18], eax
97: 660F127DF8 movlpd xmm7, [ebp-8]
102: 660F137DF0 movlpd [ebp-10], xmm7
107: D9ED fldln2
109: DD45F0 fldl [ebp-10]
112: D9F1 fyl2x
114: DD5DF0 fstpl [ebp-10]
117: FF75E8 push [ebp-18]
120: B8FBEB0820 move eax, 2008EBFB ;
0.004166666666667D0
125: E84EA90600 call 201100BA ; #<Function
SYSTEM::*%*$ANY-STUB 201100BA>
130: 50 push eax
131: B8EBEB0820 move eax, 2008EBEB ;
0.003968253986254D0
136: E8E3A80600 call 2011005A ; #<Function
SYSTEM::*%-$ANY-STUB 2011005A>
141: 8B5DE8 move ebx, [ebp-18]
144: 0BD8 or ebx, eax
146: F6C303 testb bl, 3
149: 0F852A020000 jne L4
155: 8B5DE8 move ebx, [ebp-18]
158: C1FB02 sar ebx, 2
161: 89C7 move edi, eax
163: 0FAFFB imul edi, ebx
166: 0F8019020000 jo L4
L1: 172: 57 push edi
173: B8DBEB0820 move eax, 2008EBDB ;
0.008333333333333D0
178: E859A80600 call 2010FFFA ; #<Function
SYSTEM::*%+$ANY-STUB 2010FFFA>
183: 89C3 move ebx, eax
185: 0B5DE8 or ebx, [ebp-18]
188: F6C303 testb bl, 3
191: 0F850F020000 jne L5
197: 89C3 move ebx, eax
199: C1FB02 sar ebx, 2
202: 89DF move edi, ebx
204: 0FAF7DE8 imul edi, [ebp-18]
208: 0F80FE010000 jo L5
L2: 214: 57 push edi
215: B8CBEB0820 move eax, 2008EBCB ;
0.083333333333333D0
220: E88FA80600 call 2011005A ; #<Function
SYSTEM::*%-$ANY-STUB 2011005A>
225: 89C3 move ebx, eax
227: 0B5DE8 or ebx, [ebp-18]
230: F6C303 testb bl, 3
233: 0F85F5010000 jne L6
239: 89C3 move ebx, eax
241: C1FB02 sar ebx, 2
244: 89DF move edi, ebx
246: 0FAF7DE8 imul edi, [ebp-18]
250: 0F80E4010000 jo L6
L3: 256: 897DE8 move [ebp-18], edi
259: 660F1275F0 movlpd xmm6, [ebp-10]
264: 83EC0C sub esp, C
267: C70424860C0000 move [esp], C86
274: 660F13742404 movlpd [esp+4], xmm6
280: B501 moveb ch, 1
282: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
288: 50 push eax
289: 660F1275F8 movlpd xmm6, [ebp-8]
294: 660F122DC0EB0820 movlpd xmm5, [2008EBC0] ; 0.5D0
302: F20F5EEE divsd xmm5, xmm6
306: 83EC0C sub esp, C
309: C70424860C0000 move [esp], C86
316: 660F136C2404 movlpd [esp+4], xmm5
322: B501 moveb ch, 1
324: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
330: 50 push eax
331: 660F1275F8 movlpd xmm6, [ebp-8]
336: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
344: F20F5CF5 subsd xmm6, xmm5
348: 660F1375F8 movlpd [ebp-8], xmm6
353: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
361: F20F5EEE divsd xmm5, xmm6
365: 83EC0C sub esp, C
368: C70424860C0000 move [esp], C86
375: 660F136C2404 movlpd [esp+4], xmm5
381: B501 moveb ch, 1
383: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
389: 50 push eax
390: 660F1275F8 movlpd xmm6, [ebp-8]
395: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
403: F20F5CF5 subsd xmm6, xmm5
407: 660F1375F8 movlpd [ebp-8], xmm6
412: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
420: F20F5EEE divsd xmm5, xmm6
424: 83EC0C sub esp, C
427: C70424860C0000 move [esp], C86
434: 660F136C2404 movlpd [esp+4], xmm5
440: B501 moveb ch, 1
442: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
448: 50 push eax
449: 660F1275F8 movlpd xmm6, [ebp-8]
454: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
462: F20F5CF5 subsd xmm6, xmm5
466: 660F1375F8 movlpd [ebp-8], xmm6
471: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
479: F20F5EEE divsd xmm5, xmm6
483: 83EC0C sub esp, C
486: C70424860C0000 move [esp], C86
493: 660F136C2404 movlpd [esp+4], xmm5
499: B501 moveb ch, 1
501: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
507: 50 push eax
508: 660F1275F8 movlpd xmm6, [ebp-8]
513: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
521: F20F5CF5 subsd xmm6, xmm5
525: 660F1375F8 movlpd [ebp-8], xmm6
530: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
538: F20F5EEE divsd xmm5, xmm6
542: 83EC0C sub esp, C
545: C70424860C0000 move [esp], C86
552: 660F136C2404 movlpd [esp+4], xmm5
558: B501 moveb ch, 1
560: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
566: 50 push eax
567: 660F1275F8 movlpd xmm6, [ebp-8]
572: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
580: F20F5CF5 subsd xmm6, xmm5
584: 660F1375F8 movlpd [ebp-8], xmm6
589: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
597: F20F5EEE divsd xmm5, xmm6
601: 83EC0C sub esp, C
604: C70424860C0000 move [esp], C86
611: 660F136C2404 movlpd [esp+4], xmm5
617: B501 moveb ch, 1
619: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
625: 50 push eax
626: 660F1275F8 movlpd xmm6, [ebp-8]
631: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
639: F20F5CF5 subsd xmm6, xmm5
643: 660F122D60560A20 movlpd xmm5, [200A5660] ; 1.0D0
651: F20F5EEE divsd xmm5, xmm6
655: 83EC0C sub esp, C
658: C70424860C0000 move [esp], C86
665: 660F136C2404 movlpd [esp+4], xmm5
671: B501 moveb ch, 1
673: FF1584FC1220 call [2012FC84] ; SYSTEM::RAW-
FAST-BOX-DOUBLE
679: B508 moveb ch, 8
681: FF15E4D11220 call [2012D1E4] ; -
687: 8B5DE8 move ebx, [ebp-18]
690: 0BD8 or ebx, eax
692: F6C303 testb bl, 3
695: 753B jne L7
697: 8B7DE8 move edi, [ebp-18]
700: 03F8 add edi, eax
702: 7034 jo L7
704: FD std
705: 89F8 move eax, edi
707: C9 leave
708: C3 ret
L4: 709: FF75E8 push [ebp-18]
712: E803A70600 call 201100BA ; #<Function
SYSTEM::*%*$ANY-STUB 201100BA>
717: 89C7 move edi, eax
719: E9D8FDFFFF jmp L1
L5: 724: 50 push eax
725: 8B45E8 move eax, [ebp-18]
728: E8F3A60600 call 201100BA ; #<Function
SYSTEM::*%*$ANY-STUB 201100BA>
733: 89C7 move edi, eax
735: E9F2FDFFFF jmp L2
L6: 740: 50 push eax
741: 8B45E8 move eax, [ebp-18]
744: E8E3A60600 call 201100BA ; #<Function
SYSTEM::*%*$ANY-STUB 201100BA>
749: 89C7 move edi, eax
751: E90CFEFFFF jmp L3
L7: 756: 8B7DE8 move edi, [ebp-18]
759: 83EC04 sub esp, 4
762: 8B7500 move esi, [ebp]
765: 8975FC move [ebp-4], esi
768: 83ED04 sub ebp, 4
771: 8B7508 move esi, [ebp+8]
774: 897504 move [ebp+4], esi
777: 897D08 move [ebp+8], edi
780: C9 leave
781: E9FEA50600 jmp 2010FFFA ; #<Function
SYSTEM::*%+$ANY-STUB 2010FFFA>
--
http://whylisp.com