Re: Multiple CAS (MCAS) versus Locks

I found a needed correction to Keir’s code down in the MCAS helper routine.

The convention is that any cells subject to modification by MCAS should always be read by MCAS-READ. Similarly for CCAS / CCAS-READ. The reason for that is the a speculative probe may be under way from another thread, and the exploration will be pushed along by us when we use the proper MCAS-READ and CCAS-READ. That’s what gives his code the wait-free character, in addition to lock-free.

But down in his code for MCAS he performs a CCAS followed by a peek at what was accomplished. He uses direct pointer access to the cell instead of using a CCAS-READ.

I found that while that works… the overall system performance is actually enhanced by adhering to the protocol and using a CCAS-READ instead of a direct pointer access. I think that speed up happens because the CCAS-READ pushes along other speculative MCAS/CCAS operations to negative conclusions, which works in our current MCAS favor. Otherwise, there will be occasional CAS-descriptors in the cell instead of what we had anticipated, making us more likely to fail.

The speed up is about 10-20% by using CCAS-READ instead of direct pointer access.

In the Lisp equivalent, our (REF-VALUE p) operation is equivalent to C’s direct pointer access, *p, while (MAKE-REF x) is equivalent to C’s &x (address-of) operation.

(defun mcas-help (desc)

#F

(declare (mcas-desc desc))

(let ((triples (mcas-desc-triples desc))

(status-ref (mcas-desc-status desc)))

(labels ((decide (desired)

(cas status-ref :undecided desired)

(let ((success (eq :successful (ref-value status-ref))))

(map nil (lambda (triple)

(apply #'patch-up success triple))

triples)

success))

(patch-up (success ref old new)

(cas ref desc (if success new old)))

(try-mcas (ref old new)

(ccas ref old desc status-ref)

(let ((v (ccas-read ref))) ;; ccas-read makes it faster than when using ref-value (!??)

(cond ((and (eq v old)

(eq :undecided (ref-value status-ref)))

;; must have changed beneath us when we

;; looked, then changed back again. So try

;; again. (ABA Update?)

(try-mcas ref old new))

- DM

On Feb 24, 2017, at 16:18, David McClain <dbm@refined-audiometrics.com> wrote:

Here is a good read on lock-free programming:

UCCL Tech Report 579, "Practical lock-freedom" by Keir Fraser, Feb 2004
http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf

I adapted his work on MCAS and CCAS to LW, in these (approx) 100 LOC:

<MCAS-car.lisp>

Using MCAS, I was able to speed up my Lisp implementation of Reppy Channels from 25-50 kRPC/sec to now 357 kRPC/sec, where an RPC is a dual communication exchange from a client thread to a server thread and receiving back the server response. Each RPC exchange now takes roughly 2.8 microseconds on my Mac Mini.

With Reppy Channels, you have the opportunity for thread rendezvous IF you can simultaneously mark both threads with each other’s communication data. If a thread has been previously marked, then no-go. So I needed a dual-CAS instruction. MCAS fits the bill nicely, with room for growth. Just be sure to request multiple-cell CAS with the same object request ordering, to avoid deadlock. I use a sequentially assigned ID in each object to effect my ordering. Address order cannot be relied upon in a GC Lisp heap.

With SMP multiprocessing, the use of locks can be a frequent bottleneck. Presumably the speed up occurs because a blocked Lock causes a process thread switch, whereas, even though the code in MCAS is substantial, it pays to keep the CPU’s attention. A factor of 10x in speed up is significant in my accounting.

The nice thing about Keir’s work is that these implement not only lock-free CAS, but wait-free too, as other threads gaining on the structure can help your MCAS operation along if it is found pending.

Cheers,

- DM

Re: Multiple CAS (MCAS) versus Locks

An implementation of Reppy Channels for Lispworks has been uploaded to GitHub:

https://github.com/dbmcclain/Reppy-Channels (License free)

This is a lock-free / wait-free implementation that provides for composable events, automatic scavenging of dead-end threads (if you use the channel-ref protocol), cleanup-actions on failed rendezvous, and is more than 2x faster than equivalent (where that’s even possible!) lock-bound (e.g., mailbox) implementations.

Unlike the lightweight green threads of CML, our threads are heavy and so we should avoid just popping them off willy nilly in the manner of CML programming. But we can automatically reclaim dead-end threads using Lisp’s GC finalization protocol. Making events and channels is fairly light weight, and so are readily composed on the fly.

The speed benefits, as well as composable events, and wrappers for failed rendezvous paths, argue strongly in favor of Reppy Channels for many purposes where lock-bound implementations had previously been used.

- DM

Re: Multiple CAS (MCAS) versus Locks

And so here is a corrected version of MCAS, including extensions to refer to either the CAR or the CDR of a ref-cell.

And along with it, there is an implementation of a fast lock-free/wait-free FIFO queue. All the threads in the world can simultaneously access and update these queues without need for exclusive lock control.

Re: Multiple CAS (MCAS) versus Locks

All this work on MCAS has me looking for other good applications of it. So this morning I sat down to do a Read/Write Gate. Along the way, I pushed myself for minimalism, and found that MCAS is never needed there…

Here is a sort of minimalist Read/Write Gate (many readers, one writer) that is recursive and SMP ready, using only a single CAS instruction. Any number of RW Gates can be set up as needed. About 100 LOC.

Re: Multiple CAS (MCAS) versus Locks

um… yeah… (red face)

Here is a corrected version…

Re: Multiple CAS (MCAS) versus Locks

Unable to parse email body. Email id is 14270

Re: Multiple CAS (MCAS) versus Locks

Hi Martiin,

Yes indeed… I think this fixes things…

(defun make-rwgate ()

(list 0))

(defun rwg-cas (gate locktype)

(let ((new (case locktype

(:read t)

(:write mp:*current-process*))))

(and (sys:compare-and-swap (cdr gate)

(and (cdr gate)

new)

(sys:atomic-fixnum-incf (car gate)))

))

(defun rwg-release (gate)

(let ((v (cdr gate)))

(when (zerop (sys:atomic-fixnum-decf (car gate)))

(sys:compare-and-swap (cdr gate) v nil))))

On Mar 1, 2017, at 07:44, Martin Simmons <martin@lispworks.com> wrote:

Hi David,

Unfortunately there is a problem with rwg-cas v.s. rwg-release. Suppose the
following sequence of events occurs:

0. Initially *gate* is free, i.e. (nil . 0).

1. Thread A evaluates (rwg-cas *gate* :read) and then (rwg-release *gate*),
  which reaches the start of the (setf (car gate) nil). *gate* is now
  (t . 0).

2. Thread B evaluates (rwg-cas *gate* :read), which succeeds and *gate* is now
  (t . 1).

3. Thread A evaluates the setf and *gate* is now (nil . 1). This is an
  illegal state because it would allow a write lock to be taken even though
  thread B still has a read lock.

--
Martin Simmons
LispWorks Ltd
http://www.lispworks.com/

_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html

Re: Multiple CAS (MCAS) versus Locks

nope… needs an MCAS after all…

On Mar 1, 2017, at 09:08, David McClain <dbm@refined-audiometrics.com> wrote:

Hi Martiin,

Yes indeed… I think this fixes things…

(defun make-rwgate ()
(list 0))

(defun rwg-cas (gate locktype)
(let ((new (case locktype
(:read t)
(:write mp:*current-process*))))
(and (sys:compare-and-swap (cdr gate)
(and (cdr gate)
new)
new)
(sys:atomic-fixnum-incf (car gate)))
))

(defun rwg-release (gate)
(let ((v (cdr gate)))
(when (zerop (sys:atomic-fixnum-decf (car gate)))
(sys:compare-and-swap (cdr gate) v nil))))

On Mar 1, 2017, at 07:44, Martin Simmons <martin@lispworks.com> wrote:

Hi David,

Unfortunately there is a problem with rwg-cas v.s. rwg-release. Suppose the
following sequence of events occurs:

0. Initially *gate* is free, i.e. (nil . 0).

1. Thread A evaluates (rwg-cas *gate* :read) and then (rwg-release *gate*),
  which reaches the start of the (setf (car gate) nil). *gate* is now
  (t . 0).

2. Thread B evaluates (rwg-cas *gate* :read), which succeeds and *gate* is now
  (t . 1).

3. Thread A evaluates the setf and *gate* is now (nil . 1). This is an
  illegal state because it would allow a write lock to be taken even though
  thread B still has a read lock.

--
Martin Simmons
LispWorks Ltd
http://www.lispworks.com/

_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html

Re: Multiple CAS (MCAS) versus Locks

This fixes things… Since we are attempting simultaneous ownership and count modifications, it really does need an MCAS after all.

(defstruct (rwgate

(:constructor %make-rwgate))

hdref

tlref)

(defun make-rwgate ()

(let ((cell (list 0)))

(%make-rwgate

;; pre-allocate read-only refs

:hdref (car-ref cell)

:tlref (cdr-ref cell))))

(defun rwg-cas (gate locktype)

(declare (rwgate gate)

(optimize (speed 3) (safety 0) (float 0))

(let* ((new (case locktype

(:read t)

(:write mp:*current-process*)))

(hdref (rwgate-hdref gate))

(tlref (rwgate-tlref gate))

(hd (mcas-read hdref))

(tl (mcas-read tlref)))

(cond ((mcas `((,hdref ,hd ,(1+ hd))

(,tlref ,(and tl

new) ,new)))

;; we got it

)

((eq new tl)

;; try again - we might have just failed on the count

(rwg-cas gate locktype))

)))

(defun rwg-release (gate)

(declare (rwgate gate)

(optimize (speed 3) (safety 0) (float 0))

(let* ((hdref (rwgate-hdref gate))

(tlref (rwgate-tlref gate))

(hd (mcas-read hdref))

(tl (mcas-read tlref))

(new (when (> hd 1)

tl)))

(unless (mcas `((,hdref ,hd ,(1- hd))

(,tlref ,tl ,new)))

(rwg-release gate))

))

Re: Multiple CAS (MCAS) versus Locks

Heh! I went back 8 years and, looking through my Okeanos OODBMS code, I found that I (er, rather, LispWorks) had already solved this RW-Gate problem with their shared locks… I giant tip of the hat to Martin and company!!

Here it is re-worked from Okeanos with abort cleanup functions as well as timeouts