Interesting outcome?
Upon reading one of John Reppy’s recent papers on Parallel CML, he makes mention of the need to avoid bus traffic generated by CAS instructions during spin-locking. He suggests testing for likely success using fetch / compare before actually issuing the CAS instruction.I just tried out my own version in Lisp, with / without taking care as he describes, and I see essentially zero difference over tests of 100-million iterations. It takes 55 secs User time, < 1 sec System time, and <20 sec Elapsed time on my Mac, either way. My CPU is clocking out at around 300% utilization, with the CPU meters pegged to the top for the duration.
(defmethod rmw ((cell cons) val-fn) ;; Read-Modify-Write using CAS spin-locking
(declare (function val-fn))
(loop for old = (car cell)
for new = (funcall val-fn old)
until (and ;; (eq old (car cell)) ;; <— comment / uncomment for pre-look
(sys:compare-and-swap (car cell) old new))))
(defun tst (n)
(let ((x (list nil)))
(labels ((grab (n)
(rmw x (constantly n)))
(iter (nn)
(loop for ix from nn to n by 4 do
(grab ix))))
(time
(par ;; get 4 threads running on my 4-core CPU
(iter 0)
(iter 1)
(iter 2)
(iter 3))
))))
;; —————————————————————————————————————
UM 18 > (tst #N100_000_000)
Timing the evaluation of (PAR (ITER 0) (ITER 1) (ITER 2) (ITER 3))
User time = 55.658
System time = 0.962
Elapsed time = 19.340
Allocation = 4,800,264,008 bytes
396 Page faults
T
So I’m wondering if LWM7.1/64 is already doing the pre-look ahead of a CAS issuance? Or maybe the newer Pentiums have somehow overcome the bus contention? (Mine is a Late 2014 Mac mini with a 2.6 GHz Intel Core i5). If not, then the bus contention must be totally buried by the looping overhead in my test...
- DM