Improving numeric 32bit performance
Hi all. My application is spending 2/3 of it's time running a blowfish implementation I wrote. The speed disparity between cmucl/sbcl and lispworks is just terrible; here are two test runs: ;; SBCL 0.8.16 (time (blowfish::test-speed :ntimes 1000000)) Evaluation took: 0.9 seconds of real time 0.85687 seconds of user run time 0.001 seconds of system run time 0 page faults and 20,472 bytes consed. XOS> (time (blowfish::test-speed :ntimes 10000)) Timing the evaluation of (BLOWFISH::TEST-SPEED :NTIMES 10000) user time = 3.666 system time = 0.002 Elapsed time = 0:00:04 Allocation = 90623496 bytes standard / 22 bytes conses 0 Page faults Calls to %EVAL 35 i.e. 3.66/0.9 * 100 = 400 times slower. Now, all of that is because sbcl and cmucl know how to do an unboxed 32bit logxor operation. e.g. this declaration in CMUCL #+cmu (define-compiler-macro fast-32bit-xor (a b) `(ext:truly-the unsigned-byte-32 (kernel:32bit-logical-xor ,a ,b))) helps tremendously. This is the best I could come up with with LW 4.3 #-(or cmu sbcl) (declaim (inline fast-32bit-xor)) #-(or cmu sbcl) (defun fast-32bit-xor (a b) (declare (type unsigned-byte-32 a b) (optimize (speed 3) (safety 0) (space 0) (debug 0))) (the unsigned-byte-32 (logxor a b))) This still doesn't do native 32 bit ops, and thus I'm reduced to operating on bignums. Does anyone know of any "under the hood" traps that can be used to optimize this? Is there a way, for example, to insert direct x86 assembly code into a lisp function or some such? Thanks, --ap