Re: Interesting outcome?

Followup, comparison against using LW Locking instead of CAS Spin-locks. In nearly every case I have ever tried, the LW Locks are more efficient / faster. I suspect this is because it completely avoids the situation where one thread grabs a lock, another spins waiting for it to release, while the first holder gets context-switched away, and so the lock can’t be released until the holder gets switched back in later. Locks tend to be between 2-3 times faster, despite what would seem like a lot of overhead.

(defun tst (n)

(let ((x (list nil))

(lock (mp:make-lock)))

(labels ((grab (n)

(mp:with-lock (lock)

(setf (car x) n)))

(iter (nn)

(loop for ix from nn to n by 4 do

(grab ix))))

(time

(par

(iter 0)

(iter 1)

(iter 2)

(iter 3))

))))

UM 20 > (tst #N100_000_000)

Timing the evaluation of (PAR (ITER 0) (ITER 1) (ITER 2) (ITER 3))

User time = 12.240

System time = 8.763

Elapsed time = 8.473

Allocation = 212,000 bytes

1 Page faults

- DM

On Nov 18, 2017, at 09:27, David McClain <dbm@refined-audiometrics.com> wrote:

Upon reading one of John Reppy’s recent papers on Parallel CML, he makes mention of the need to avoid bus traffic generated by CAS instructions during spin-locking. He suggests testing for likely success using fetch / compare before actually issuing the CAS instruction.

I just tried out my own version in Lisp, with / without taking care as he describes, and I see essentially zero difference over tests of 100-million iterations. It takes 55 secs User time, < 1 sec System time, and <20 sec Elapsed time on my Mac, either way. My CPU is clocking out at around 300% utilization, with the CPU meters pegged to the top for the duration.

(defmethod rmw ((cell cons) val-fn) ;; Read-Modify-Write using CAS spin-locking
(declare (function val-fn))
(loop for old = (car cell)
for new = (funcall val-fn old)
until (and ;; (eq old (car cell)) ;; <— comment / uncomment for pre-look
(sys:compare-and-swap (car cell) old new))))

(defun tst (n)
(let ((x (list nil)))
(labels ((grab (n)
(rmw x (constantly n)))
(iter (nn)
(loop for ix from nn to n by 4 do
(grab ix))))
(time
(par ;; get 4 threads running on my 4-core CPU
(iter 0)
(iter 1)
(iter 2)
(iter 3))
))))
;; —————————————————————————————————————

UM 18 > (tst #N100_000_000)
Timing the evaluation of (PAR (ITER 0) (ITER 1) (ITER 2) (ITER 3))

User time = 55.658
System time = 0.962
Elapsed time = 19.340
Allocation = 4,800,264,008 bytes
396 Page faults
T

So I’m wondering if LWM7.1/64 is already doing the pre-look ahead of a CAS issuance? Or maybe the newer Pentiums have somehow overcome the bus contention? (Mine is a Late 2014 Mac mini with a 2.6 GHz Intel Core i5). If not, then the bus contention must be totally buried by the looping overhead in my test...

- DM