Lisp HUG Maillist Archive

Interesting outcome?

Upon reading one of John Reppy’s recent papers on Parallel CML, he makes mention of the need to avoid bus traffic generated by CAS instructions during spin-locking. He suggests testing for likely success using fetch / compare before actually issuing the CAS instruction.

I just tried out my own version in Lisp, with / without taking care as he describes, and I see essentially zero difference over tests of 100-million iterations. It takes 55 secs User time, < 1 sec System time, and <20 sec Elapsed time on my Mac, either way. My CPU is clocking out at around 300% utilization, with the CPU meters pegged to the top for the duration.

(defmethod rmw ((cell cons) val-fn) ;; Read-Modify-Write using CAS spin-locking
  (declare (function val-fn))
  (loop for old = (car cell)
        for new = (funcall val-fn old)
        until (and ;; (eq old (car cell))  ;; <— comment / uncomment for pre-look
                   (sys:compare-and-swap (car cell) old new))))

(defun tst (n)
  (let ((x (list nil)))
    (labels ((grab (n)
               (rmw x (constantly n)))
             (iter (nn)
               (loop for ix from nn to n by 4 do
                     (grab ix))))
      (time
       (par ;; get 4 threads running on my 4-core CPU
         (iter 0)
         (iter 1)
         (iter 2)
         (iter 3))
       ))))
;; —————————————————————————————————————

UM 18 > (tst #N100_000_000)
Timing the evaluation of (PAR (ITER 0) (ITER 1) (ITER 2) (ITER 3))

User time    =       55.658
System time  =        0.962
Elapsed time =       19.340
Allocation   = 4,800,264,008 bytes
396 Page faults
T

So I’m wondering if LWM7.1/64 is already doing the pre-look ahead of a CAS issuance? Or maybe the newer Pentiums have somehow overcome the bus contention? (Mine is a Late 2014 Mac mini with a 2.6 GHz Intel Core i5). If not, then the bus contention must be totally buried by the looping overhead in my test...

- DM

Re: Interesting outcome?

Followup, comparison against using LW Locking instead of CAS Spin-locks. In nearly every case I have ever tried, the LW Locks are more efficient / faster. I suspect this is because it completely avoids the situation where one thread grabs a lock, another spins waiting for it to release, while the first holder gets context-switched away, and so the lock can’t be released until the holder gets switched back in later. Locks tend to be between 2-3 times faster, despite what would seem like a lot of overhead.

(defun tst (n)
  (let ((x    (list nil))
        (lock (mp:make-lock)))
    (labels ((grab (n)
               (mp:with-lock (lock)
                 (setf (car x) n)))
             (iter (nn)
               (loop for ix from nn to n by 4 do
                     (grab ix))))
      (time
       (par
         (iter 0)
         (iter 1)
         (iter 2)
         (iter 3))
       ))))


UM 20 > (tst #N100_000_000)
Timing the evaluation of (PAR (ITER 0) (ITER 1) (ITER 2) (ITER 3))

User time    =       12.240
System time  =        8.763
Elapsed time =        8.473
Allocation   = 212,000 bytes
1 Page faults
T

- DM


On Nov 18, 2017, at 09:27, David McClain <dbm@refined-audiometrics.com> wrote:

Upon reading one of John Reppy’s recent papers on Parallel CML, he makes mention of the need to avoid bus traffic generated by CAS instructions during spin-locking. He suggests testing for likely success using fetch / compare before actually issuing the CAS instruction.

I just tried out my own version in Lisp, with / without taking care as he describes, and I see essentially zero difference over tests of 100-million iterations. It takes 55 secs User time, < 1 sec System time, and <20 sec Elapsed time on my Mac, either way. My CPU is clocking out at around 300% utilization, with the CPU meters pegged to the top for the duration.

(defmethod rmw ((cell cons) val-fn) ;; Read-Modify-Write using CAS spin-locking
  (declare (function val-fn))
  (loop for old = (car cell)
        for new = (funcall val-fn old)
        until (and ;; (eq old (car cell))  ;; <— comment / uncomment for pre-look
                   (sys:compare-and-swap (car cell) old new))))

(defun tst (n)
  (let ((x (list nil)))
    (labels ((grab (n)
               (rmw x (constantly n)))
             (iter (nn)
               (loop for ix from nn to n by 4 do
                     (grab ix))))
      (time
       (par ;; get 4 threads running on my 4-core CPU
         (iter 0)
         (iter 1)
         (iter 2)
         (iter 3))
       ))))
;; —————————————————————————————————————

UM 18 > (tst #N100_000_000)
Timing the evaluation of (PAR (ITER 0) (ITER 1) (ITER 2) (ITER 3))

User time    =       55.658
System time  =        0.962
Elapsed time =       19.340
Allocation   = 4,800,264,008 bytes
396 Page faults
T

So I’m wondering if LWM7.1/64 is already doing the pre-look ahead of a CAS issuance? Or maybe the newer Pentiums have somehow overcome the bus contention? (Mine is a Late 2014 Mac mini with a 2.6 GHz Intel Core i5). If not, then the bus contention must be totally buried by the looping overhead in my test...

- DM

Updated at: 2020-12-10 08:30 UTC