Poor Man's GCD
I have been doing some DSP programming here, and found some surprising results...On OS X Snow Leopard, we have GCD (Grand Central Dispatch) and a very elaborate OS thread level support. But not so on OS X 10.4, or on Windows/7 (AFAIK?). So I decided to try to implement a poor-man's version of GCD for parallelizing portions of code, to run on Windows/7, written entirely in LW.
First the surprise... LW Mailboxes do not appear to have a queue of pending threads awaiting messages. There is a lock implemented in them, and I had always assumed that lock was for preventing multiple writers from stepping on each other. But in fact, that lock also appears to work for multiple readers. You really can allow multiple threads to read the same mailbox, and they are properly serialized in their access. I never see the same message being sent to more than one receiver thread.
I suppose that has much to do with the way each LW thread has a wait function, and as long as the mailbox reads are locked, the effect is much the same as having a queue of pending threads on the mailbox. In fact, the LW scheme probably offers even greater flexibility by not tying up a thread solely to a mailbox.
So, a poor-man's GCD can be constructed by launching a pool of LW threads performing an infinite loop of reading one specific mailbox, evaluating the message, and returning results to a reply-to mailbox included in each message.
In order to avoid deadlock, the thread executing a PAR operation also needs to participate in the pool by repeatedly performing a mailbox read on that same mailbox, and performing whatever code is in a message, and then only when the mailbox empties is it safe to start waiting on replies from the various reply-to mailboxes created with each job enqueue operation.
Parallel operation is made possible with a (PAR clause1 clause2 ... ) macro that bundles up each clause in a lambda closure, and for each clause, it creates a new reply-to mailbox and sends that reply-to along with the closure to the central GCD mailbox. The end of the PAR clause has an implicit JOIN, which makes the calling thread participate as described above, before reaping the results from each of the reply-to mailboxes.
All of this works the same way on both OS X and Windows/7. All 100% Lisp. I do not call on OS X GCD at all.
This works like a champ. But here's another surprise...
Take a simple test like the following:
Make two lists of a zillion integers. Two lists, so to avoid any possible penalty for SMP access to the same memory. Ask each thread to add up all the numbers in one list or the other, and return the list consisting of that sum and the MP::PROCESS-STACK-GROUP of itself (this appears to be a unique, easy way, of identifying different threads). This simple-minded tests crudely simulates some kinds of DSP processing, like FIR filtering.
;; ------------------------------------------
(defvar vals1 (loop for ix from 0 below 10000000 collect ix))
(defvar vals2 (loop for ix from 0 below 10000000 collect (* 2 ix)))
(defun doit (vals)
(list (mp::process-stack-group mp:*current-process*)
(loop repeat 1 collect (loop for v in vals sum v))))
(defun tstpar ()
(par
(doit vals1)
(doit vals2)))
(defun tstser ()
(list
(doit vals1)
(doit vals2)))
(time (tstpar))
(time (tstser))
;; ------------------------------------------
On LW 32-bit, regardless of OS, I usually see a performance *penalty* from parallelized code, compared to single-thread execution. This for all of OS X 10.4, OS X 10.6, Windows/7-64. You can see that PAR really does fire up the available processors to near 100% utilization, but despite that, the overall wall-clock duration is slightly worse than if you just have one thread evaluate the entire block of code, changing PAR into LIST. And there is significant, e.g., 20% System Time overhead in the parallel versions.
Only when I move to LW 64-bit do I actually see a speedup, and darn near linear with number of available cores and threads. And in fact, for my particular DSP-like test, LW-64 is almost 100x faster than LW-32, whether in parallel mode or in serial execution mode. In order to obtain measurable durations in the timings, I had to collect the sums 100 times over inside the DOIT routine. (change (loop repeat 1 collect ...) to (loop repeat 100 collect ....)).
On my 2-core machines, I see almost twice the performance in the parallel code. If I add two more clauses to the tester routines, and with a pool of 4 or 5 threads, then on my 4-core I3 processor I see a near 4x speedup.
So the LW-64 versions appear to have a better impedance match to the underlying 64-bit OS, for both OS X and Windows/7. Something about the LW-32 is very poorly matched and produces penalties for parallel implementations. I don't know if this is because LW-64 has a better threading model than LW-32? or because there is less OS interfacing from a 64-bit code to the OS, than for a simulated (?) 32-bit code.
Dr. David McClain
Chief Technical Officer
Refined Audiometrics Laboratory
4391 N. Camino Ferreo
Tucson, AZ 85750
email: dbm@refined-audiometrics.com
phone: 1.520.390.3995