Improving numeric 32bit performance


Hi all.

My application is spending 2/3 of it's time running
a blowfish implementation I wrote.

The speed disparity between cmucl/sbcl and lispworks
is just terrible; here are two test runs:

  ;; SBCL 0.8.16
  (time (blowfish::test-speed :ntimes 1000000))

  Evaluation took:
		   0.9 seconds of real time
		   0.85687 seconds of user run time
		   0.001 seconds of system run time
		   0 page faults and
		   20,472 bytes consed.



  XOS> (time (blowfish::test-speed :ntimes 10000))
  Timing the evaluation of (BLOWFISH::TEST-SPEED :NTIMES 10000)

  user time    =      3.666
  system time  =      0.002
  Elapsed time =   0:00:04
  Allocation   = 90623496 bytes standard / 22 bytes conses
  0 Page faults
  Calls to %EVAL    35

i.e. 3.66/0.9 * 100 = 400 times slower.

Now, all of that is because sbcl and cmucl know how
to do an unboxed 32bit logxor operation.

e.g. this declaration in CMUCL

  #+cmu
  (define-compiler-macro fast-32bit-xor (a b)
    `(ext:truly-the unsigned-byte-32 (kernel:32bit-logical-xor ,a ,b)))

helps tremendously.

This is the best I could come up with with LW 4.3

#-(or cmu sbcl)
(declaim (inline fast-32bit-xor))
#-(or cmu sbcl)
(defun fast-32bit-xor (a b)
  (declare (type unsigned-byte-32 a b)
	   (optimize (speed 3) (safety 0) (space 0) (debug 0)))
  (the unsigned-byte-32
    (logxor a b)))

This still doesn't do native 32 bit ops, and thus I'm reduced
to operating on bignums.

Does anyone know of any "under the hood" traps that can be
used to optimize this?   Is there a way, for example, to
insert direct x86 assembly code into a lisp function or some such?

Thanks,
			--ap

Re: Improving numeric 32bit performance

Alain.Picard@memetrics.com writes:

>
> Does anyone know of any "under the hood" traps that can be
> used to optimize this?   Is there a way, for example, to
> insert direct x86 assembly code into a lisp function or some such?
I suggest you ask Xanalys directly. They will know best.

Regards
Friedrich

Re: Improving numeric 32bit performance

Alain.Picard@memetrics.com writes:

> My application is spending 2/3 of it's time running
> a blowfish implementation I wrote.

Familiar - my application is spending a lot of cpu doing blowfish
(my own implementation) as well :-(, but not as much as yours
(I only encrypt/decrypt small amounts of data).

> Does anyone know of any "under the hood" traps that can be
> used to optimize this?   Is there a way, for example, to
> insert direct x86 assembly code into a lisp function or some such?

Not that I know of. But please tell Xanalys that you need 32 bit
unboxed integers, they need our feedback to be able to set priorities
for later releases.

(In addition to 32 bit unboxed integers, I'd love to have some
 more built-in bignum arithmetic as well: A built-in and efficient
 implementation of the exponential modulus function would make 
 RSA /so/ much faster...)
-- 
  (espen)

Re: Improving numeric 32bit performance

* Alain Picard wrote:
> Does anyone know of any "under the hood" traps that can be
> used to optimize this?   Is there a way, for example, to
> insert direct x86 assembly code into a lisp function or some such?

Foreign call a C implementation?

--tim

Re: Improving numeric 32bit performance

Tim Bradshaw writes:
 > * Alain Picard wrote:
 > > Does anyone know of any "under the hood" traps that can be
 > > used to optimize this?   Is there a way, for example, to
 > > insert direct x86 assembly code into a lisp function or some such?
 > 
 > Foreign call a C implementation?

Well, yes.  But that's basically admitting defeat.  From my
viewpoint, it just makes deployment more painful, adding yet
another dependence.  Still, that _is_ what I will do if I
have to; it's not like I'm not already including a gazillion
C libs... (GD, postgres, matlab, etc etc...)

Just splicing in a few lines of assembler would have been ideal.