RE: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code

" It seems that LW is faster (stats below) despite almost 2x the number of
instructions."

Optimizing speed often requires increasing space requirements; algorithms
101. Unrolling loops for instance means lots more instructions, but you
don't have to keep checking the loop termination conditions, hence, faster
code.

1) decide what you are optimizing for

2) measure that directly, not indirectly.

i.e., if you are optimizing speed, time your code, don't count
instructions. 

And if you are really doing something heavy duty for which you need
assembler <cough> C, then use the FFI to get to C. You don't buy a
maserati to compete in a monster truck tournament. 

Bradford W. Miller
Cognitive/Computer Scientist/Engineer
Cognitive Systems Group Lead

Raytheon IDS
1847 West Main Road 
Portsmouth, RI 02871-1087

(401) 842-3578 (v)


-----Original Message-----
From: owner-lisp-hug@lispworks.com [mailto:owner-lisp-hug@lispworks.com]
On Behalf Of joel@whylisp.com
Sent: Thursday, October 19, 2006 6:53 AM
To: lisp-hug@lispworks.com
Subject: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code


Folks,

I'm pitting LW 5.0 on Mac Intel against ACL 8.0 in a final round of  
pre-purchase checks.

I have some heavy-duty floating-point computations. The C function  
looks like this:

double digamma(double x)
{
     double p;
     x=x+6;
     p=1/(x*x);
     p=(((0.004166666666667*p-0.003968253986254)*p+
	0.008333333333333)*p-0.083333333333333)*p;
     p=p+log(x)-0.5/x-1/(x-1)-1/(x-2)-1/(x-3)-1/(x-4)-1/(x-5)-1/(x-6);
     return p;
}

GCC generates about 200 assembler instructions for the code above.

This is the Lisp code I came up with using Duane Rettig's (Franz) help:

(defun digamma/setq-x (x)
   (declare (:explain :types :inlining :variables)
            (optimize (speed 3) (safety 0) (debug 0))
            ((double-float (0.0d0) *) x))
   (incf x 6d0)
   (let* ((p (/ 1d0 (* x x)))
          (logx (log x))
          (one 1d0))
     (declare (double-float one p))
     (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
                                 0.003968253986254d0))
                         0.008333333333333d0) p)
                   0.083333333333333d0) p))
     (+ p (- logx
             (/ 0.5d0 x)
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))))
     ))

This generates 361 instructions [1]. I tweaked the same function for  
LispWorks like this:

(defun digamma/setq-x (x)
   (declare (optimize (speed 3) (safety 0) (debug 0) (float 0))
            (type (double-float (0.0d0) *) x))
   (incf x 6d0)
   (let* ((p (/ 1d0 (* x x)))
          (logx (log x))
          (one 1d0))
     (declare (type (double-float one p)))
     (setq p (* (- (* (+ (* p (- (* p 0.004166666666667d0)
                                 0.003968253986254d0))
                         0.008333333333333d0) p)
                   0.083333333333333d0) p))
     (+ p (- logx
             (/ 0.5d0 x)
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))
             (/ one (setq x (- x one)))))
     ))

I'm not sure if (declare ((double-float (0.0d0) *) x)) is just as  
valid for LW. Making it (declare (type (double-float (0.0d0) *) x))  
increases instruction count to 781 from 541 and I start seeing calls  
to SYSTEM::RAW-FAST-BOX-DOUBLE in the assembler listing which (I  
suppose) is an improvement [2].

Does the disassembly comment below say that this is a generic version  
of the function?

; #<Function SYSTEM::*%+$ANY-STUB 2010FFFA>

It seems that LW is faster (stats below) despite almost 2x the number  
of instructions. Am I correct in this assessment? I'm no assembler  
expert but would this be because of better pipelining, etc.?

Did I get the declare above right? Did I miss any optimization  
annotations for LW?

Last but not least, the only way I was able to arrive at the code  
above is by using   (declare (:explain :types :inlining :variables))  
in ACL 8.0. This tells me what is being boxed and what is not, what  
can be inlined, what is being held in registers, etc.

I understand there's no such help provided by LW 5.0. How do you  
folks approach iterative optimization without any feedback from the  
compiler?

The stats for LW 5.0 are:

(time (loop for i from 10000 to 1000000 do (digamma/setq-x i)))
Timing the evaluation of (LOOP FOR I FROM 10000 TO 1000000 DO  
(DIGAMMA/SETQ-X 10000))

User time    =        3.275
System time  =        0.015
Elapsed time =        4.593
Allocation   = 403946972 bytes
2 Page faults
Calls to %EVAL    14850061

The stats for ACL:

; cpu time (non-gc) 5,470 msec user, 10 msec system
; cpu time (gc)     520 msec user, 0 msec system
; cpu time (total)  5,990 msec user, 10 msec system
; real time  6,015 msec
; space allocation:
;  13,860,170 cons cells, 213,841,600 other bytes, 0 static bytes

[1] ACL 8.0 disasssemble

    0: pushl	ebp
    1: movl	ebp,esp
    3: pushl	esi
    4: subl	esp,$68
    7: movsd	xmm5,[eax-10]
   12: movl	ebx,[esi+18]	;	6.0d0
   15: movsd	xmm4,[ebx-10]
   20: addsd	xmm5,xmm4
   24: movl	ebx,[esi+22]	;	1.0d0
   27: movsd	xmm4,[ebx-10]
   32: movsd	xmm3,xmm5
   36: mulsd	xmm3,xmm5
   40: divsd	xmm4,xmm3
   44: movsd	[ebp-32],xmm4
   49: movsd	[ebp-40],xmm5
   54: movsd	xmm7,xmm5
   58: xorl	ecx,ecx
   60: call	*[edi+531]	;	SYS::NEW-DOUBLE-FLOAT
   66: movl	ebx,[esi+26]	;	LOG
   69: movb	cl,$1
   71: call	*edi
   73: movsd	xmm5,[eax-10]
   78: movl	ebx,[esi+22]	;	1.0d0
   81: movsd	xmm4,[ebx-10]
   86: movl	ebx,[esi+30]	;	0.004166666666667d0
   89: movsd	xmm3,[ebx-10]
   94: movsd	xmm2,[ebp-32]
   99: mulsd	xmm3,xmm2
103: movl	ebx,[esi+34]	;	0.003968253986254d0
106: movsd	[ebp-48],xmm5
111: movsd	xmm5,[ebx-10]
116: movsd	xmm6,xmm3
120: subsd	xmm6,xmm5
124: movsd	xmm5,xmm6
128: mulsd	xmm5,xmm2
132: movl	ebx,[esi+38]	;	0.008333333333333d0
135: movsd	xmm3,[ebx-10]
140: addsd	xmm5,xmm3
144: mulsd	xmm5,xmm2
148: movl	ebx,[esi+42]	;	0.083333333333333d0
151: movsd	xmm3,[ebx-10]
156: subsd	xmm5,xmm3
160: mulsd	xmm2,xmm5
164: movl	ebx,[esi+46]	;	0.5d0
167: movsd	xmm5,[ebx-10]
172: movsd	xmm3,[ebp-40]
177: divsd	xmm5,xmm3
181: movsd	[ebp-32],xmm2
186: movsd	xmm2,[ebp-48]
191: movsd	xmm6,xmm2
195: subsd	xmm6,xmm5
199: movsd	xmm5,xmm6
203: subsd	xmm3,xmm4
207: movsd	xmm2,xmm3
211: movsd	xmm6,xmm4
215: divsd	xmm6,xmm3
219: movsd	xmm3,xmm6
223: subsd	xmm5,xmm3
227: subsd	xmm2,xmm4
231: movsd	xmm3,xmm2
235: movsd	xmm6,xmm4
239: divsd	xmm6,xmm2
243: movsd	xmm2,xmm6
247: subsd	xmm5,xmm2
251: subsd	xmm3,xmm4
255: movsd	xmm2,xmm3
259: movsd	xmm6,xmm4
263: divsd	xmm6,xmm3
267: movsd	xmm3,xmm6
271: subsd	xmm5,xmm3
275: subsd	xmm2,xmm4
279: movsd	xmm3,xmm2
283: movsd	xmm6,xmm4
287: divsd	xmm6,xmm2
291: movsd	xmm2,xmm6
295: subsd	xmm5,xmm2
299: subsd	xmm3,xmm4
303: movsd	xmm2,xmm3
307: movsd	xmm6,xmm4
311: divsd	xmm6,xmm3
315: movsd	xmm3,xmm6
319: subsd	xmm5,xmm3
323: subsd	xmm2,xmm4
327: divsd	xmm4,xmm2
331: subsd	xmm5,xmm4
335: movsd	xmm4,[ebp-32]
340: addsd	xmm5,xmm4
344: movsd	xmm7,xmm5
348: xorl	ecx,ecx
350: call	*[edi+531]	;	SYS::NEW-DOUBLE-FLOAT
356: clc
357: leave
358: movl	esi,[ebp-4]
361: ret

[2] Disassembly for LW 5.0:

200A56EA:
        0:      55               push  ebp
        1:      89E5             move  ebp, esp
        3:      83EC14           sub   esp, 14
        6:      C7042486140000   move  [esp], 1486
       13:      50               push  eax
       14:      89C7             move  edi, eax
       16:      DD4705           fldl  [edi+5]
       19:      DD5DF8           fstpl [ebp-8]
       22:      660F126DF8       movlpd xmm5, [ebp-8]
       27:      660F123550560A20 movlpd xmm6, [200A5650]  ; 6.0D0
       35:      F20F58F5         addsd xmm6, xmm5
       39:      660F1375F8       movlpd [ebp-8], xmm6
       44:      660F1275F8       movlpd xmm6, [ebp-8]
       49:      660F126DF8       movlpd xmm5, [ebp-8]
       54:      F20F59EE         mulsd xmm5, xmm6
       58:      660F123560560A20 movlpd xmm6, [200A5660]  ; 1.0D0
       66:      F20F5EF5         divsd xmm6, xmm5
       70:      83EC0C           sub   esp, C
       73:      C70424860C0000   move  [esp], C86
       80:      660F13742404     movlpd [esp+4], xmm6
       86:      B501             moveb ch, 1
       88:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
       94:      8945E8           move  [ebp-18], eax
       97:      660F127DF8       movlpd xmm7, [ebp-8]
      102:      660F137DF0       movlpd [ebp-10], xmm7
      107:      D9ED             fldln2
      109:      DD45F0           fldl  [ebp-10]
      112:      D9F1             fyl2x
      114:      DD5DF0           fstpl [ebp-10]
      117:      FF75E8           push  [ebp-18]
      120:      B8FBEB0820       move  eax, 2008EBFB    ;  
0.004166666666667D0
      125:      E84EA90600       call  201100BA         ; #<Function  
SYSTEM::*%*$ANY-STUB 201100BA>
      130:      50               push  eax
      131:      B8EBEB0820       move  eax, 2008EBEB    ;  
0.003968253986254D0
      136:      E8E3A80600       call  2011005A         ; #<Function  
SYSTEM::*%-$ANY-STUB 2011005A>
      141:      8B5DE8           move  ebx, [ebp-18]
      144:      0BD8             or    ebx, eax
      146:      F6C303           testb bl, 3
      149:      0F852A020000     jne   L4
      155:      8B5DE8           move  ebx, [ebp-18]
      158:      C1FB02           sar   ebx, 2
      161:      89C7             move  edi, eax
      163:      0FAFFB           imul  edi, ebx
      166:      0F8019020000     jo    L4
L1:  172:      57               push  edi
      173:      B8DBEB0820       move  eax, 2008EBDB    ;  
0.008333333333333D0
      178:      E859A80600       call  2010FFFA         ; #<Function  
SYSTEM::*%+$ANY-STUB 2010FFFA>
      183:      89C3             move  ebx, eax
      185:      0B5DE8           or    ebx, [ebp-18]
      188:      F6C303           testb bl, 3
      191:      0F850F020000     jne   L5
      197:      89C3             move  ebx, eax
      199:      C1FB02           sar   ebx, 2
      202:      89DF             move  edi, ebx
      204:      0FAF7DE8         imul  edi, [ebp-18]
      208:      0F80FE010000     jo    L5
L2:  214:      57               push  edi
      215:      B8CBEB0820       move  eax, 2008EBCB    ;  
0.083333333333333D0
      220:      E88FA80600       call  2011005A         ; #<Function  
SYSTEM::*%-$ANY-STUB 2011005A>
      225:      89C3             move  ebx, eax
      227:      0B5DE8           or    ebx, [ebp-18]
      230:      F6C303           testb bl, 3
      233:      0F85F5010000     jne   L6
      239:      89C3             move  ebx, eax
      241:      C1FB02           sar   ebx, 2
      244:      89DF             move  edi, ebx
      246:      0FAF7DE8         imul  edi, [ebp-18]
      250:      0F80E4010000     jo    L6
L3:  256:      897DE8           move  [ebp-18], edi
      259:      660F1275F0       movlpd xmm6, [ebp-10]
      264:      83EC0C           sub   esp, C
      267:      C70424860C0000   move  [esp], C86
      274:      660F13742404     movlpd [esp+4], xmm6
      280:      B501             moveb ch, 1
      282:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      288:      50               push  eax
      289:      660F1275F8       movlpd xmm6, [ebp-8]
      294:      660F122DC0EB0820 movlpd xmm5, [2008EBC0]  ; 0.5D0
      302:      F20F5EEE         divsd xmm5, xmm6
      306:      83EC0C           sub   esp, C
      309:      C70424860C0000   move  [esp], C86
      316:      660F136C2404     movlpd [esp+4], xmm5
      322:      B501             moveb ch, 1
      324:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      330:      50               push  eax
      331:      660F1275F8       movlpd xmm6, [ebp-8]
      336:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      344:      F20F5CF5         subsd xmm6, xmm5
      348:      660F1375F8       movlpd [ebp-8], xmm6
      353:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      361:      F20F5EEE         divsd xmm5, xmm6
      365:      83EC0C           sub   esp, C
      368:      C70424860C0000   move  [esp], C86
      375:      660F136C2404     movlpd [esp+4], xmm5
      381:      B501             moveb ch, 1
      383:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      389:      50               push  eax
      390:      660F1275F8       movlpd xmm6, [ebp-8]
      395:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      403:      F20F5CF5         subsd xmm6, xmm5
      407:      660F1375F8       movlpd [ebp-8], xmm6
      412:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      420:      F20F5EEE         divsd xmm5, xmm6
      424:      83EC0C           sub   esp, C
      427:      C70424860C0000   move  [esp], C86
      434:      660F136C2404     movlpd [esp+4], xmm5
      440:      B501             moveb ch, 1
      442:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      448:      50               push  eax
      449:      660F1275F8       movlpd xmm6, [ebp-8]
      454:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      462:      F20F5CF5         subsd xmm6, xmm5
      466:      660F1375F8       movlpd [ebp-8], xmm6
      471:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      479:      F20F5EEE         divsd xmm5, xmm6
      483:      83EC0C           sub   esp, C
      486:      C70424860C0000   move  [esp], C86
      493:      660F136C2404     movlpd [esp+4], xmm5
      499:      B501             moveb ch, 1
      501:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      507:      50               push  eax
      508:      660F1275F8       movlpd xmm6, [ebp-8]
      513:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      521:      F20F5CF5         subsd xmm6, xmm5
      525:      660F1375F8       movlpd [ebp-8], xmm6
      530:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      538:      F20F5EEE         divsd xmm5, xmm6
      542:      83EC0C           sub   esp, C
      545:      C70424860C0000   move  [esp], C86
      552:      660F136C2404     movlpd [esp+4], xmm5
      558:      B501             moveb ch, 1
      560:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      566:      50               push  eax
      567:      660F1275F8       movlpd xmm6, [ebp-8]
      572:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      580:      F20F5CF5         subsd xmm6, xmm5
      584:      660F1375F8       movlpd [ebp-8], xmm6
      589:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      597:      F20F5EEE         divsd xmm5, xmm6
      601:      83EC0C           sub   esp, C
      604:      C70424860C0000   move  [esp], C86
      611:      660F136C2404     movlpd [esp+4], xmm5
      617:      B501             moveb ch, 1
      619:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      625:      50               push  eax
      626:      660F1275F8       movlpd xmm6, [ebp-8]
      631:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      639:      F20F5CF5         subsd xmm6, xmm5
      643:      660F122D60560A20 movlpd xmm5, [200A5660]  ; 1.0D0
      651:      F20F5EEE         divsd xmm5, xmm6
      655:      83EC0C           sub   esp, C
      658:      C70424860C0000   move  [esp], C86
      665:      660F136C2404     movlpd [esp+4], xmm5
      671:      B501             moveb ch, 1
      673:      FF1584FC1220     call  [2012FC84]       ; SYSTEM::RAW- 
FAST-BOX-DOUBLE
      679:      B508             moveb ch, 8
      681:      FF15E4D11220     call  [2012D1E4]       ; -
      687:      8B5DE8           move  ebx, [ebp-18]
      690:      0BD8             or    ebx, eax
      692:      F6C303           testb bl, 3
      695:      753B             jne   L7
      697:      8B7DE8           move  edi, [ebp-18]
      700:      03F8             add   edi, eax
      702:      7034             jo    L7
      704:      FD               std
      705:      89F8             move  eax, edi
      707:      C9               leave
      708:      C3               ret
L4:  709:      FF75E8           push  [ebp-18]
      712:      E803A70600       call  201100BA         ; #<Function  
SYSTEM::*%*$ANY-STUB 201100BA>
      717:      89C7             move  edi, eax
      719:      E9D8FDFFFF       jmp   L1
L5:  724:      50               push  eax
      725:      8B45E8           move  eax, [ebp-18]
      728:      E8F3A60600       call  201100BA         ; #<Function  
SYSTEM::*%*$ANY-STUB 201100BA>
      733:      89C7             move  edi, eax
      735:      E9F2FDFFFF       jmp   L2
L6:  740:      50               push  eax
      741:      8B45E8           move  eax, [ebp-18]
      744:      E8E3A60600       call  201100BA         ; #<Function  
SYSTEM::*%*$ANY-STUB 201100BA>
      749:      89C7             move  edi, eax
      751:      E90CFEFFFF       jmp   L3
L7:  756:      8B7DE8           move  edi, [ebp-18]
      759:      83EC04           sub   esp, 4
      762:      8B7500           move  esi, [ebp]
      765:      8975FC           move  [ebp-4], esi
      768:      83ED04           sub   ebp, 4
      771:      8B7508           move  esi, [ebp+8]
      774:      897504           move  [ebp+4], esi
      777:      897D08           move  [ebp+8], edi
      780:      C9               leave
      781:      E9FEA50600       jmp   2010FFFA         ; #<Function  
SYSTEM::*%+$ANY-STUB 2010FFFA>

--
http://whylisp.com
RE: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code

Re: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code

Re: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code

Re: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code

Re: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code

Re: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code

Re: LW 5.0 vs ACL 8.0: Optimizing heavy floating-point code