Lisp Realtime Speedups
Hi All,
I have (not surprisingly) had several inquiries about how this speedup was achieved. So I'll answer here for the benefit of everyone.
First of all, this speedup was measured under LWM 5.1.1 on an iMac running Tiger (10.4.11) and a 2 GHz Intel Core 2 Duo processor with 2 GB RAM. I have also now verified the speedup on systems as old as a PowerPC iMac, which runs about 3x slower, and on newer systems running Leopard. All of them show substantial speedup.
By far the single most important speedup actually had little to do with the calling conventions and array transfer between Lisp and the external vDSP library. It turns out that that portion of the FFT never did take very much time. We did speed that up by about 2x-3x, but the most important speedup of the 670x, or so, has to do with data organization in memory.
A truly general purpose signal processor needs to be able to perform complex-valued convolutions for quite a variety of reasons. The most natural approach would be to use COMPLEX single-precision or double-precision values. That turns out to be a real speed killer because type COMPLEX is not a basic type for Lispworks (and probably not for any other Lisp either). A complex value will not be held in registers, and they will not be stored in unboxed arrays. All collections of type complex appear to use boxed elements. That is a huge time eater.
So, while I have never been much of a fan of the MatLab approach of storing real and imaginary components in separate arrays, that is precise what is needed here. Some external library may be comfortable with a packed (re, im) pairs representation of complex values. That appears to be the case in some of the Intel libraries (PPL, MKL). But it so happens that the native vDSP library on OS/X also wants to see FFT data in segregated form.
Hence, underneath, instead of using direct COMPLEX representations, arrays of such, and operations on such (e.g., (* c (conjugate c)) to get the power), it is better to keep real and imaginary components in separate vectors and arrays, and to perform the complex arithmetic, when needed, by your self -- use (+ (* r r) (* i i)) to get the power, use (atan i r) instead of (phase c), and so on.
By making that simple change, we brought our data manipulation time down by a huge factor.
There is also a tremendous speedup by using Martin's suggestion about :LISP-SIMPLE-1D-ARRAY for passing vectors of unboxed arrays to the outside world. Martin didn't specifically indicate that vectors of floats (single and double precision) would be stored as safely unboxed linear arrays, but they are. You can see that for yourself by using HCL:FIND-OBJECT-SIZE and trying variations of vectors in size and data type. This may be an non-portable expectation, but is quite reasonable to expect any serious language to create simple vectors of floats as unboxed representations. All the more so when you allocate in :STATIC for direct passing to the outside world.
Another thing that helped the speedup was to forego generality. My old code was willing to take numeric vectors in any form and coerce them to single- or double-precision floats, as requested, before sending to the outside libraries. By restricting the view to require correctly typed values ahead of time, and just being sure to enforce that in your own code, you save a tremendous amount of runtime. The old, more general, way was constructing a C-compatible dynamic FLI array, then copying one element at a time from a general input Lisp vector, coercing to the correct float type, and then storing into the FLI array, and then reversing the whole procedure on return. That kills a huge amount of time.
By adhering to strictly typed data you can use FLI:REPLACE-FOREIGN-ARRAY to copy entire vectors to and fro. Hence another huge speedup.
The final speedup came from simple caching of temporary buffers. When I need a scratch buffer of 4K doubles, I allocate one if not already cached, then hold on to it for future possible needs. I put a limit of 4K double complex on the largest buffer I'm willing to cache. Anything larger (much more rare in my code) will be forced to allocate and GC the larger scratch buffers. By far the most common sizes for my stuff are either 1024 or 4096 elements.
Together with an unboxed representation of complex-valued arrays, these data caching and unboxed techniques save a tremendous amount of data allocation. The result is that my routines, after the first call, see no more page misses at all, and the amount of memory allocated becomes around 10-50 bytes per call. This is a decrease from about 180-400KB per call in the old code.
Take all of these factors together -- use segregated complex types, use strictly typed arrays and vectors, use :STATIC array allocation, use FLI:REPLACE-FOREIGN-ARRAY, avoid memory allocation and boxing, and avoid page misses -- all of these together produce a speed up over the old code of more than 670x. That is extremely impressive.
And so now being able to perform 1024 element complex-valued FFT's at 100 kHz, there is really nothing standing in the way of some serious realtime audio processing with nearly arbitrary FIR filtering and convolutions. That is very very impressive to me, to see Lisp compete head on with hand-crafted C and DSP Assembly. At this point, I have no interest in reverting to such tedious practices when I can avoid it. And using Lisp directly is a prominent possibility and likelihood.
I didn't set out to compete against C or DSP Assembly code. But to my surprise, the code generated by LWM is very impressive, provided you take some responsibility for sensible fast computing practices. Very nice!
Dr. David McClain
Chief Technical Officer
Refined Audiometrics Laboratory
4391 N. Camino Ferreo
Tucson, AZ 85750
phone: 1.520.390.3995
Re: Lisp Realtime Speedups
On Fri, 14 Nov 2008 18:55:32 -0000, David McClain
<dbm@refined-audiometrics.com> wrote:
> Hi All,
>
> I have (not surprisingly) had several inquiries about how this
> speedup was achieved. So I'll answer here for the benefit of everyone.
>
> First of all, this speedup was measured under LWM 5.1.1 on an iMac
> running Tiger (10.4.11) and a 2 GHz Intel Core 2 Duo processor with 2
> GB RAM. I have also now verified the speedup on systems as old as a
> PowerPC iMac, which runs about 3x slower, and on newer systems
> running Leopard. All of them show substantial speedup.
>
> By far the single most important speedup actually had little to do
> with the calling conventions and array transfer between Lisp and the
> external vDSP library. It turns out that that portion of the FFT
> never did take very much time. We did speed that up by about 2x-3x,
> but the most important speedup of the 670x, or so, has to do with
> data organization in memory.
>
> A truly general purpose signal processor needs to be able to perform
> complex-valued convolutions for quite a variety of reasons. The most
> natural approach would be to use COMPLEX single-precision or double-
> precision values. That turns out to be a real speed killer because
> type COMPLEX is not a basic type for Lispworks (and probably not for
> any other Lisp either). A complex value will not be held in
> registers, and they will not be stored in unboxed arrays. All
> collections of type complex appear to use boxed elements. That is a
> huge time eater.
>
> So, while I have never been much of a fan of the MatLab approach of
> storing real and imaginary components in separate arrays, that is
> precise what is needed here. Some external library may be comfortable
> with a packed (re, im) pairs representation of complex values. That
> appears to be the case in some of the Intel libraries (PPL, MKL). But
> it so happens that the native vDSP library on OS/X also wants to see
> FFT data in segregated form.
>
> Hence, underneath, instead of using direct COMPLEX representations,
> arrays of such, and operations on such (e.g., (* c (conjugate c)) to
> get the power), it is better to keep real and imaginary components in
> separate vectors and arrays, and to perform the complex arithmetic,
> when needed, by your self -- use (+ (* r r) (* i i)) to get the
> power, use (atan i r) instead of (phase c), and so on.
>
> By making that simple change, we brought our data manipulation time
> down by a huge factor.
>
> There is also a tremendous speedup by using Martin's suggestion
> about :LISP-SIMPLE-1D-ARRAY for passing vectors of unboxed arrays to
> the outside world. Martin didn't specifically indicate that vectors
> of floats (single and double precision) would be stored as safely
> unboxed linear arrays, but they are. You can see that for yourself by
> using HCL:FIND-OBJECT-SIZE and trying variations of vectors in size
> and data type. This may be an non-portable expectation, but is quite
> reasonable to expect any serious language to create simple vectors of
> floats as unboxed representations. All the more so when you allocate
> in :STATIC for direct passing to the outside world.
>
> Another thing that helped the speedup was to forego generality. My
> old code was willing to take numeric vectors in any form and coerce
> them to single- or double-precision floats, as requested, before
> sending to the outside libraries. By restricting the view to require
> correctly typed values ahead of time, and just being sure to enforce
> that in your own code, you save a tremendous amount of runtime. The
> old, more general, way was constructing a C-compatible dynamic FLI
> array, then copying one element at a time from a general input Lisp
> vector, coercing to the correct float type, and then storing into the
> FLI array, and then reversing the whole procedure on return. That
> kills a huge amount of time.
>
> By adhering to strictly typed data you can use FLI:REPLACE-FOREIGN-
> ARRAY to copy entire vectors to and fro. Hence another huge speedup.
>
> The final speedup came from simple caching of temporary buffers. When
> I need a scratch buffer of 4K doubles, I allocate one if not already
> cached, then hold on to it for future possible needs. I put a limit
> of 4K double complex on the largest buffer I'm willing to cache.
> Anything larger (much more rare in my code) will be forced to
> allocate and GC the larger scratch buffers. By far the most common
> sizes for my stuff are either 1024 or 4096 elements.
>
> Together with an unboxed representation of complex-valued arrays,
> these data caching and unboxed techniques save a tremendous amount of
> data allocation. The result is that my routines, after the first
> call, see no more page misses at all, and the amount of memory
> allocated becomes around 10-50 bytes per call. This is a decrease
> from about 180-400KB per call in the old code.
>
> Take all of these factors together -- use segregated complex types,
> use strictly typed arrays and vectors, use :STATIC array allocation,
> use FLI:REPLACE-FOREIGN-ARRAY, avoid memory allocation and boxing,
> and avoid page misses -- all of these together produce a speed up
> over the old code of more than 670x. That is extremely impressive.
>
> And so now being able to perform 1024 element complex-valued FFT's at
> 100 kHz, there is really nothing standing in the way of some serious
> realtime audio processing with nearly arbitrary FIR filtering and
> convolutions. That is very very impressive to me, to see Lisp compete
> head on with hand-crafted C and DSP Assembly. At this point, I have
> no interest in reverting to such tedious practices when I can avoid
> it. And using Lisp directly is a prominent possibility and likelihood.
>
> I didn't set out to compete against C or DSP Assembly code. But to my
> surprise, the code generated by LWM is very impressive, provided you
> take some responsibility for sensible fast computing practices. Very
> nice!
>
> Dr. David McClain
> Chief Technical Officer
> Refined Audiometrics Laboratory
> 4391 N. Camino Ferreo
> Tucson, AZ 85750
>
> email: dbm@refined-audiometrics.com
> phone: 1.520.390.3995
> web: http://www.refined-audiometrics.com
>
>
>
Hi David,
this post never made much sense to me at the time even though it seemed to
talk about very interesting stuff, so I decided to hold onto it to come
back to later.
So, how did it work out in the end, did you stick with lisp for doing
heavy DSP processing, or did you have to come back to the low level?
Did you find any other cool ways to speed your audio processing even more,
or just make even more fun?
_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html
Re: Lisp Realtime Speedups
Actually, using lisp as a C code compiler seems to be one of cool ways to
do things. I actually came across a lisp library a couple of days ago that
generates GPU kernels source for CUDA, one would be able to do some
massive DSP processing of teraflop scale this way. It seems to be a
problem as of now to perform general purpose GPU processing from inside
lisp easily but that might not be the case for too long.
Thanks for your replies, David, they always stand out and give plenty of
food for thought.
On Tue, 14 Aug 2012 00:03:35 +0100, David McClain
<dbm@refined-audiometrics.com> wrote:
> BTW... I have lately had numerous cases where I either had to write C
> for signal processing, or some custom DSP language encoded in C (e.g.,
> Tensilica).
>
> In those cases I have taken to writing a simple cross compiler from high
> level abstract Lisp notation to the needed low level C code. Along the
> way, my Lisp compilers perform code sequence optimization to give the
> shortest possible sequence of C code to perform needed computations.
>
> It is really simple to do this in Lisp.
>
> ----------------- (from an earlier letter to Usha Millar of LW)
> --------------------------
>
> Here is an example from my daily routine, where I need to write some DSP
> code for an integer-based DSP which uses fractional arithmetic
> encodings. It does not have floating point, but each 24-bit integer can
> represent a number with some (possibly varying) number of bits that
> represent its fractional value.
>
> For example, Pi in Q.21 notation is 6588397 (21-bits of fractional
> value, 2 bits of integer portion, and 1-bit of sign, for a total of 24
> bits)
>
> So, as you could imagine, writing code for elaborate arithmetic routines
> in such a system can quickly become a nightmare of book-keeping -- who
> has how many bits, what happens when I multiply two of them together,
> how many bits of shifting are needed, and in which direction, if I need
> to add two dissimilar fractional values, etc.
>
> So Lisp to the rescue with a simple meta-compiler. I declare the vars
> I'm about to use, their names, and how many bits of fraction each has,
> then I can simply write a Lisp expression for the arithmetic, and Lisp
> keeps track of all the bit precisions for me, performs shifting as
> needed, (or issues error messaages if I mess up!), and then it generates
> C code for the DSP compiler.
>
> It was most direct to just use 3-address encoding in Lisp, and let the
> follow-on C compiler optimize away any unnecessary loads and stores.
>
> Example, from the computation of Log2(x) using a minimax polynomial
> approximation, takes a Q.23 value in, and produces a Q.18 value out.
> Here's a portion of the routine, shown in comments as a Horner's
> encoding of the nested multiply-adds, then in Lisp with my simple
> meta-compiler, and then the C results (F23_t is just an int24_t
> designated so the human reader knows it is a Q.nn fractional value.):
>
> (let ((a (make-frac-opnd
> :name "a"
> :nbits 23))
> (b2 (fconst 23 -0.08621210823155004))
> (b1 (fconst 23 0.49506024005351723))
> (b0 (fconst 23 -0.41378789176844955))
> (b (make-frac-opnd
> :name "b"
> :nbits 23))
> (tmp23 (make-frac-opnd
> :name "tmp"
> :nbits 23))
> (nz18 (make-frac-opnd
> :name "nz"
> :nbits 18))
> (c18 (make-frac-opnd
> :name "c"
> :nbits 18)))
> ;; example from computation of Log2(x) with MiniMax polynomial
> approximation:
> ;; c = (a * (a * -0.08621210823155004) + 0.49506024005351723) +
> -0.41378789176844955) + nz
> ;; for a = Q23
> ;; nz = Q18
> ;; c = Q18
> (femit #>.end
> // a is Q.23
> // nz is Q.18
> .end)
> (fdecls tmp23 c18)
> (fadd
> (fadj
> (fadd
> (fmul
> a
> (fadd
> (fmul a b2 tmp23)
> b1 tmp23)
> tmp23)
> b0 tmp23)
> c18)
> nz18 c18))
>
> ==>
>
> // a is Q.23
> // nz is Q.18
>
> F23_t tmp; // tmp is Q.23
> F23_t c; // c is Q.18
> tmp = ((F23_t)((int64_t)a * (int64_t)((F23_t)(-723200))) >> 23);
> tmp = tmp + ((F23_t)(4152866));
> tmp = ((F23_t)((int64_t)a * (int64_t)tmp) >> 23);
> tmp = tmp + ((F23_t)(-3471104));
> c = (tmp >> 5);
> c = c + nz;
> On Aug 13, 2012, at 15:22 PM, Yuri Davidovsky wrote:
>
>>
>> On Fri, 14 Nov 2008 18:55:32 -0000, David McClain
>> <dbm@refined-audiometrics.com> wrote:
>>
>>> Hi All,
>>>
>>> I have (not surprisingly) had several inquiries about how this
>>> speedup was achieved. So I'll answer here for the benefit of everyone.
>>>
>>> First of all, this speedup was measured under LWM 5.1.1 on an iMac
>>> running Tiger (10.4.11) and a 2 GHz Intel Core 2 Duo processor with 2
>>> GB RAM. I have also now verified the speedup on systems as old as a
>>> PowerPC iMac, which runs about 3x slower, and on newer systems
>>> running Leopard. All of them show substantial speedup.
>>>
>>> By far the single most important speedup actually had little to do
>>> with the calling conventions and array transfer between Lisp and the
>>> external vDSP library. It turns out that that portion of the FFT
>>> never did take very much time. We did speed that up by about 2x-3x,
>>> but the most important speedup of the 670x, or so, has to do with
>>> data organization in memory.
>>>
>>> A truly general purpose signal processor needs to be able to perform
>>> complex-valued convolutions for quite a variety of reasons. The most
>>> natural approach would be to use COMPLEX single-precision or double-
>>> precision values. That turns out to be a real speed killer because
>>> type COMPLEX is not a basic type for Lispworks (and probably not for
>>> any other Lisp either). A complex value will not be held in
>>> registers, and they will not be stored in unboxed arrays. All
>>> collections of type complex appear to use boxed elements. That is a
>>> huge time eater.
>>>
>>> So, while I have never been much of a fan of the MatLab approach of
>>> storing real and imaginary components in separate arrays, that is
>>> precise what is needed here. Some external library may be comfortable
>>> with a packed (re, im) pairs representation of complex values. That
>>> appears to be the case in some of the Intel libraries (PPL, MKL). But
>>> it so happens that the native vDSP library on OS/X also wants to see
>>> FFT data in segregated form.
>>>
>>> Hence, underneath, instead of using direct COMPLEX representations,
>>> arrays of such, and operations on such (e.g., (* c (conjugate c)) to
>>> get the power), it is better to keep real and imaginary components in
>>> separate vectors and arrays, and to perform the complex arithmetic,
>>> when needed, by your self -- use (+ (* r r) (* i i)) to get the
>>> power, use (atan i r) instead of (phase c), and so on.
>>>
>>> By making that simple change, we brought our data manipulation time
>>> down by a huge factor.
>>>
>>> There is also a tremendous speedup by using Martin's suggestion
>>> about :LISP-SIMPLE-1D-ARRAY for passing vectors of unboxed arrays to
>>> the outside world. Martin didn't specifically indicate that vectors
>>> of floats (single and double precision) would be stored as safely
>>> unboxed linear arrays, but they are. You can see that for yourself by
>>> using HCL:FIND-OBJECT-SIZE and trying variations of vectors in size
>>> and data type. This may be an non-portable expectation, but is quite
>>> reasonable to expect any serious language to create simple vectors of
>>> floats as unboxed representations. All the more so when you allocate
>>> in :STATIC for direct passing to the outside world.
>>>
>>> Another thing that helped the speedup was to forego generality. My
>>> old code was willing to take numeric vectors in any form and coerce
>>> them to single- or double-precision floats, as requested, before
>>> sending to the outside libraries. By restricting the view to require
>>> correctly typed values ahead of time, and just being sure to enforce
>>> that in your own code, you save a tremendous amount of runtime. The
>>> old, more general, way was constructing a C-compatible dynamic FLI
>>> array, then copying one element at a time from a general input Lisp
>>> vector, coercing to the correct float type, and then storing into the
>>> FLI array, and then reversing the whole procedure on return. That
>>> kills a huge amount of time.
>>>
>>> By adhering to strictly typed data you can use FLI:REPLACE-FOREIGN-
>>> ARRAY to copy entire vectors to and fro. Hence another huge speedup.
>>>
>>> The final speedup came from simple caching of temporary buffers. When
>>> I need a scratch buffer of 4K doubles, I allocate one if not already
>>> cached, then hold on to it for future possible needs. I put a limit
>>> of 4K double complex on the largest buffer I'm willing to cache.
>>> Anything larger (much more rare in my code) will be forced to
>>> allocate and GC the larger scratch buffers. By far the most common
>>> sizes for my stuff are either 1024 or 4096 elements.
>>>
>>> Together with an unboxed representation of complex-valued arrays,
>>> these data caching and unboxed techniques save a tremendous amount of
>>> data allocation. The result is that my routines, after the first
>>> call, see no more page misses at all, and the amount of memory
>>> allocated becomes around 10-50 bytes per call. This is a decrease
>>> from about 180-400KB per call in the old code.
>>>
>>> Take all of these factors together -- use segregated complex types,
>>> use strictly typed arrays and vectors, use :STATIC array allocation,
>>> use FLI:REPLACE-FOREIGN-ARRAY, avoid memory allocation and boxing,
>>> and avoid page misses -- all of these together produce a speed up
>>> over the old code of more than 670x. That is extremely impressive.
>>>
>>> And so now being able to perform 1024 element complex-valued FFT's at
>>> 100 kHz, there is really nothing standing in the way of some serious
>>> realtime audio processing with nearly arbitrary FIR filtering and
>>> convolutions. That is very very impressive to me, to see Lisp compete
>>> head on with hand-crafted C and DSP Assembly. At this point, I have
>>> no interest in reverting to such tedious practices when I can avoid
>>> it. And using Lisp directly is a prominent possibility and likelihood.
>>>
>>> I didn't set out to compete against C or DSP Assembly code. But to my
>>> surprise, the code generated by LWM is very impressive, provided you
>>> take some responsibility for sensible fast computing practices. Very
>>> nice!
>>>
>>> Dr. David McClain
>>> Chief Technical Officer
>>> Refined Audiometrics Laboratory
>>> 4391 N. Camino Ferreo
>>> Tucson, AZ 85750
>>>
>>> email: dbm@refined-audiometrics.com
>>> phone: 1.520.390.3995
>>> web: http://www.refined-audiometrics.com
>>>
>>>
>>>
>>
>> Hi David,
>>
>> this post never made much sense to me at the time even though it seemed
>> to talk about very interesting stuff, so I decided to hold onto it to
>> come back to later.
>>
>> So, how did it work out in the end, did you stick with lisp for doing
>> heavy DSP processing, or did you have to come back to the low level?
>>
>> Did you find any other cool ways to speed your audio processing even
>> more, or just make even more fun?
>>
>> _______________________________________________
>> Lisp Hug - the mailing list for LispWorks users
>> lisp-hug@lispworks.com
>> http://www.lispworks.com/support/lisp-hug.html
>>
>>
>
> Dr. David McClain
> dbm@refined-audiometrics.com
>
>
_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html