Lisp HUG Maillist Archive

Speed of various implementation schemes...

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM









Re: Speed of various implementation schemes...

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM










Re: Speed of various implementation schemes...

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM











Re: Speed of various implementation schemes...

Hi David,

A few minor things:

- The STRUCT implementation is faster because, among other reasons, it doesn’t perform any dispatch anymore. Unlike in CLOS, where lb-get and lb-append could now be implemented for other classes as well, you can’t do the same for your STRUCT implementation. This may not be relevant for your purposes, but still matters in terms of comparing apples to oranges.

For the jump vector versions, you should consider declaring types. simple-vector / svref is potentially faster than aref on an undeclared variable. But more importantly, it can make a difference to tell the compiler that funcall and apply are seeing function objects. In the general case, funcall and apply also work on symbols, so this means that there is usually an implicit test for symbol vs function object. By declaring the type (through (funcall (the function foo) …) or (array function …) and aref lookup) you should be able to squeeze out some additional speed.

Pascal

On 11 Feb 2017, at 17:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM










--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.



Re: Speed of various implementation schemes...

Did you try to declare the functions inline?

Pascal

On 11 Feb 2017, at 17:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM












--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.



Re: Speed of various implementation schemes...

Hi Pascal,

Excellent points, all! Yes, we forego the possibility to generalize when we use the STRUCT implementation. And no, I had not tried declaring anything INLINE yet. 

As for using Jump Vectors… I consider that an Assembly language technique, and move in direct contradiction to the abstractions offered by symbolic programming in Lisp. That takes me backward by more than 40 years, and I really have no desire to go there again. I just offered it up as a possible fast implementation. But I wouldn’t be caught dead ever doing that in the raw. I’d have to wrap everything about JV’s with some macrology, at the very least.

Thanks for your insightful comments!

- DM


On Feb 11, 2017, at 09:46, Pascal Costanza <pc@p-cos.net> wrote:

Hi David,

A few minor things:

- The STRUCT implementation is faster because, among other reasons, it doesn’t perform any dispatch anymore. Unlike in CLOS, where lb-get and lb-append could now be implemented for other classes as well, you can’t do the same for your STRUCT implementation. This may not be relevant for your purposes, but still matters in terms of comparing apples to oranges.

For the jump vector versions, you should consider declaring types. simple-vector / svref is potentially faster than aref on an undeclared variable. But more importantly, it can make a difference to tell the compiler that funcall and apply are seeing function objects. In the general case, funcall and apply also work on symbols, so this means that there is usually an implicit test for symbol vs function object. By declaring the type (through (funcall (the function foo) …) or (array function …) and aref lookup) you should be able to squeeze out some additional speed.

Pascal

On 11 Feb 2017, at 17:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM










--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.




Re: Speed of various implementation schemes...

This little exercise did not attempt much in the way of optimizing patterns. Beyond my convenient little #F reader macro for “fast”, it shows what could be expected from a quick ad-hoc implementation using various different techniques that spring to mind.

I don’t do much in the way of production code any more. If I did, then it would be expected that some time be devoted to optimization. But for most of my coding, it is ad-hoc, on the fly as I think it up. And this test case offers some simple guidance in how to structure Q&D code for better performance out of the gate.

Probably the few problems really deserving of some speedup here in my lab are those that attempt design optimizations using various forms of gradient search, simplex search, and genetic programming search. Those problems have very little in the way of internal data structures, and are mostly dominated by gradients, or lack thereof… In this case, e.g., blind deconvolution, antenna analysis, filter design, I probably need better mathematics, not better Lisp code.

- DM


_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html

Re: Speed of various implementation schemes...

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM












Re: Speed of various implementation schemes...

Looking at the disassembly, I see that the labels functions are being closed over on the stack, only on major entry. Not repeatedly. 

However, APPEND-ITEM and GET-LIST, once closed over, also need to call the newly constructed closure for SET-CDR. And that is sitting on the stack as a freshly made closure. So the compiled code for APPEND-ITEM and GET-LIST need to call something that must be determined dynamically each time.

If I make SET-CDR depend on no free variables, then it can be succinctly encapsulated and never needs closing over on the stack. So this version also runs extremely quickly:

(defun bake-off-raw3 (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (cell x) ;; <— no free vars
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

The slowdown happens when there is mutual interaction among LABELS functions, and the called functions have free vars.

- DM

On Feb 11, 2017, at 10:28, David McClain <dbm@refined-audiometrics.com> wrote:

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM













Re: Speed of various implementation schemes...

When you declare the local functions inline, the first version actually is the fastest.

(defun bake-off-raw (ngrp)
  (declare (optimize (speed 3) (safety 0) (debug 0) (space 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-time get-list))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

Pascal

On 11 Feb 2017, at 18:58, David McClain <dbm@refined-audiometrics.com> wrote:

Looking at the disassembly, I see that the labels functions are being closed over on the stack, only on major entry. Not repeatedly. 

However, APPEND-ITEM and GET-LIST, once closed over, also need to call the newly constructed closure for SET-CDR. And that is sitting on the stack as a freshly made closure. So the compiled code for APPEND-ITEM and GET-LIST need to call something that must be determined dynamically each time.

If I make SET-CDR depend on no free variables, then it can be succinctly encapsulated and never needs closing over on the stack. So this version also runs extremely quickly:

(defun bake-off-raw3 (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (cell x) ;; <— no free vars
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

The slowdown happens when there is mutual interaction among LABELS functions, and the called functions have free vars.

- DM

On Feb 11, 2017, at 10:28, David McClain <dbm@refined-audiometrics.com> wrote:

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM














--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.



Re: Speed of various implementation schemes...

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM


On Feb 11, 2017, at 11:25, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM

On Feb 11, 2017, at 11:07, Pascal Costanza <pc@p-cos.net> wrote:

When you declare the local functions inline, the first version actually is the fastest.

(defun bake-off-raw (ngrp)
  (declare (optimize (speed 3) (safety 0) (debug 0) (space 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-time get-list))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

Pascal

On 11 Feb 2017, at 18:58, David McClain <dbm@refined-audiometrics.com> wrote:

Looking at the disassembly, I see that the labels functions are being closed over on the stack, only on major entry. Not repeatedly. 

However, APPEND-ITEM and GET-LIST, once closed over, also need to call the newly constructed closure for SET-CDR. And that is sitting on the stack as a freshly made closure. So the compiled code for APPEND-ITEM and GET-LIST need to call something that must be determined dynamically each time.

If I make SET-CDR depend on no free variables, then it can be succinctly encapsulated and never needs closing over on the stack. So this version also runs extremely quickly:

(defun bake-off-raw3 (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (cell x) ;; <— no free vars
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

The slowdown happens when there is mutual interaction among LABELS functions, and the called functions have free vars.

- DM

On Feb 11, 2017, at 10:28, David McClain <dbm@refined-audiometrics.com> wrote:

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM














--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.





Re: Speed of various implementation schemes...

So now, I’m going back over DLAMBDA to see if some slight changes might make it much more acceptable. I certainly do appreciate its style in little cases like this.

- DM


On Feb 11, 2017, at 11:26, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM


On Feb 11, 2017, at 11:25, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM

On Feb 11, 2017, at 11:07, Pascal Costanza <pc@p-cos.net> wrote:

When you declare the local functions inline, the first version actually is the fastest.

(defun bake-off-raw (ngrp)
  (declare (optimize (speed 3) (safety 0) (debug 0) (space 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-time get-list))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

Pascal

On 11 Feb 2017, at 18:58, David McClain <dbm@refined-audiometrics.com> wrote:

Looking at the disassembly, I see that the labels functions are being closed over on the stack, only on major entry. Not repeatedly. 

However, APPEND-ITEM and GET-LIST, once closed over, also need to call the newly constructed closure for SET-CDR. And that is sitting on the stack as a freshly made closure. So the compiled code for APPEND-ITEM and GET-LIST need to call something that must be determined dynamically each time.

If I make SET-CDR depend on no free variables, then it can be succinctly encapsulated and never needs closing over on the stack. So this version also runs extremely quickly:

(defun bake-off-raw3 (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (cell x) ;; <— no free vars
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

The slowdown happens when there is mutual interaction among LABELS functions, and the called functions have free vars.

- DM

On Feb 11, 2017, at 10:28, David McClain <dbm@refined-audiometrics.com> wrote:

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM














--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.






Re: Speed of various implementation schemes...

This version of the jump vector approach seems to give the best performance on my machine:

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  (declare (optimize (speed 3) (space 0) (safety 0) (debug 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-item get))
      (let ((jv (vector #'append-item #'get)))
        (declare ((simple-array function (2)) jv))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Pascal

On 11 Feb 2017, at 19:28, David McClain <dbm@refined-audiometrics.com> wrote:

So now, I’m going back over DLAMBDA to see if some slight changes might make it much more acceptable. I certainly do appreciate its style in little cases like this.

- DM


On Feb 11, 2017, at 11:26, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM


On Feb 11, 2017, at 11:25, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM

On Feb 11, 2017, at 11:07, Pascal Costanza <pc@p-cos.net> wrote:

When you declare the local functions inline, the first version actually is the fastest.

(defun bake-off-raw (ngrp)
  (declare (optimize (speed 3) (safety 0) (debug 0) (space 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-time get-list))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

Pascal

On 11 Feb 2017, at 18:58, David McClain <dbm@refined-audiometrics.com> wrote:

Looking at the disassembly, I see that the labels functions are being closed over on the stack, only on major entry. Not repeatedly. 

However, APPEND-ITEM and GET-LIST, once closed over, also need to call the newly constructed closure for SET-CDR. And that is sitting on the stack as a freshly made closure. So the compiled code for APPEND-ITEM and GET-LIST need to call something that must be determined dynamically each time.

If I make SET-CDR depend on no free variables, then it can be succinctly encapsulated and never needs closing over on the stack. So this version also runs extremely quickly:

(defun bake-off-raw3 (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (cell x) ;; <— no free vars
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

The slowdown happens when there is mutual interaction among LABELS functions, and the called functions have free vars.

- DM

On Feb 11, 2017, at 10:28, David McClain <dbm@refined-audiometrics.com> wrote:

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM














--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.







--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.



Re: Speed of various implementation schemes...

Well, if I can modify DLAMBA to use Jump Vectors, then I will. But if I have to know a cardinal index in order to call a function, then I might as well hang it up and go to work for Apple...

But offhand, just reframing the DLAMBDA like this, brings down the time from 60+ sec to 4.5 sec:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil)))
             (reset-list (x)
               (setf cell nil)
               (set-cdr (list x))))
      (declare (inline set-cdr append-item get-list reset-list))
      (let ((alst (list (cons :append #'append-item)
                        (cons :get    #'get-list)
                        (cons :reset  #'reset-list))))
        (lambda (sel &rest args)
          (apply (cdr (assoc sel alst)) args))
        ))))



On Feb 11, 2017, at 11:39, Pascal Costanza <pc@p-cos.net> wrote:

This version of the jump vector approach seems to give the best performance on my machine:

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  (declare (optimize (speed 3) (space 0) (safety 0) (debug 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-item get))
      (let ((jv (vector #'append-item #'get)))
        (declare ((simple-array function (2)) jv))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Pascal

On 11 Feb 2017, at 19:28, David McClain <dbm@refined-audiometrics.com> wrote:

So now, I’m going back over DLAMBDA to see if some slight changes might make it much more acceptable. I certainly do appreciate its style in little cases like this.

- DM


On Feb 11, 2017, at 11:26, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM


On Feb 11, 2017, at 11:25, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM

On Feb 11, 2017, at 11:07, Pascal Costanza <pc@p-cos.net> wrote:

When you declare the local functions inline, the first version actually is the fastest.

(defun bake-off-raw (ngrp)
  (declare (optimize (speed 3) (safety 0) (debug 0) (space 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-time get-list))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

Pascal

On 11 Feb 2017, at 18:58, David McClain <dbm@refined-audiometrics.com> wrote:

Looking at the disassembly, I see that the labels functions are being closed over on the stack, only on major entry. Not repeatedly. 

However, APPEND-ITEM and GET-LIST, once closed over, also need to call the newly constructed closure for SET-CDR. And that is sitting on the stack as a freshly made closure. So the compiled code for APPEND-ITEM and GET-LIST need to call something that must be determined dynamically each time.

If I make SET-CDR depend on no free variables, then it can be succinctly encapsulated and never needs closing over on the stack. So this version also runs extremely quickly:

(defun bake-off-raw3 (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (cell x) ;; <— no free vars
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

The slowdown happens when there is mutual interaction among LABELS functions, and the called functions have free vars.

- DM

On Feb 11, 2017, at 10:28, David McClain <dbm@refined-audiometrics.com> wrote:

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM














--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.







--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.




Re: Speed of various implementation schemes...

Okay!

Here is a rewrite of DLAMBDA that makes our test timings go from 60+ sec down to 4.3 sec. That’s a pretty good improvement. Functionally identical to Doug Hoyte’s version, as tested thoroughly through a large body of Butterfly code which has been heavily dependent on its proper functioning.

Instead of genning up a CASE expression the way Doug did, we build named LABELS functions, and perform selection to them from a getf plist which pairs the selector symbols with freshly closed function pointers. Startup may take longer than Doug’s version in order to create these new closures, but in use, all the work has been completed ahead of time.

This code uses Pascal’s suggestion to declare the LABELS functions as INLINE.

;; ----------------------------------------------------------------------

(defun bad-dlambda (&rest args)
  (declare (ignore args))
  (error "Invalid selector in dlambda"))

(defmacro! dlambda (&rest ds)
  (let* ((dsels   (mapcar #'first ds))
         has-default
         (dfnames (mapcar (lambda (sel)
                            (if (eq sel t)
                                (setf has-default g!default)
                              (gensym (string sel))))
                          dsels)))
    `(labels
         ,(mapcar (lambda (dfname clause)
                    `(,dfname ,@(rest clause)))
                  dfnames ds)
       (declare (inline ,@dfnames)) ;; thanks Pascal...
       (let ((,g!plst (list ,@(mapcan (lambda (dsel dfname)
                                        (unless (eq dsel t)
                                          `(',dsel #',dfname)))
                                      dsels dfnames))))
         ,(if has-default
              `(lambda (&rest ,g!args)
                 (if-let (,g!fn (getf ,g!plst (first ,g!args) nil))
                     (apply ,g!fn (rest ,g!args))
                     (apply #',g!default ,g!args)))
            `(lambda (,g!sel &rest ,g!args)
               (apply (getf ,g!plst ,g!sel 'bad-dlambda) ,g!args)))
         ))
    ))

(defmacro dcase (args &rest clauses)
  `(apply (dlambda
            ,@clauses)
          ,args))

#+:LISPWORKS
(editor:setup-indent "DCASE" 1)



On Feb 11, 2017, at 11:42, David McClain <dbm@refined-audiometrics.com> wrote:

Well, if I can modify DLAMBA to use Jump Vectors, then I will. But if I have to know a cardinal index in order to call a function, then I might as well hang it up and go to work for Apple...

But offhand, just reframing the DLAMBDA like this, brings down the time from 60+ sec to 4.5 sec:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil)))
             (reset-list (x)
               (setf cell nil)
               (set-cdr (list x))))
      (declare (inline set-cdr append-item get-list reset-list))
      (let ((alst (list (cons :append #'append-item)
                        (cons :get    #'get-list)
                        (cons :reset  #'reset-list))))
        (lambda (sel &rest args)
          (apply (cdr (assoc sel alst)) args))
        ))))



On Feb 11, 2017, at 11:39, Pascal Costanza <pc@p-cos.net> wrote:

This version of the jump vector approach seems to give the best performance on my machine:

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  (declare (optimize (speed 3) (space 0) (safety 0) (debug 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-item get))
      (let ((jv (vector #'append-item #'get)))
        (declare ((simple-array function (2)) jv))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Pascal

On 11 Feb 2017, at 19:28, David McClain <dbm@refined-audiometrics.com> wrote:

So now, I’m going back over DLAMBDA to see if some slight changes might make it much more acceptable. I certainly do appreciate its style in little cases like this.

- DM


On Feb 11, 2017, at 11:26, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM


On Feb 11, 2017, at 11:25, David McClain <dbm@refined-audiometrics.com> wrote:

Oooh! That’s very interesting!! Thanks so much.

I clearly need to learn to appreciate DECLARE much more than I currently do.

- DM

On Feb 11, 2017, at 11:07, Pascal Costanza <pc@p-cos.net> wrote:

When you declare the local functions inline, the first version actually is the fastest.

(defun bake-off-raw (ngrp)
  (declare (optimize (speed 3) (safety 0) (debug 0) (space 0) (compilation-speed 0)))
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (declare (inline set-cdr append-time get-list))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

Pascal

On 11 Feb 2017, at 18:58, David McClain <dbm@refined-audiometrics.com> wrote:

Looking at the disassembly, I see that the labels functions are being closed over on the stack, only on major entry. Not repeatedly. 

However, APPEND-ITEM and GET-LIST, once closed over, also need to call the newly constructed closure for SET-CDR. And that is sitting on the stack as a freshly made closure. So the compiled code for APPEND-ITEM and GET-LIST need to call something that must be determined dynamically each time.

If I make SET-CDR depend on no free variables, then it can be succinctly encapsulated and never needs closing over on the stack. So this version also runs extremely quickly:

(defun bake-off-raw3 (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (cell x) ;; <— no free vars
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))

The slowdown happens when there is mutual interaction among LABELS functions, and the called functions have free vars.

- DM

On Feb 11, 2017, at 10:28, David McClain <dbm@refined-audiometrics.com> wrote:

I suspected that the reason for the slowdown with the internal LABELS functions might be the fact that new closures are needed, at least once on major entry, because they all reference a free variable CELL.

So here is a version without needing closure creation:

(defun bake-off-raw2 (ngrp)
  #F
  (declare (fixnum ngrp))
  (let (cell)
    (labels ((set-cdr (cell x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (cell x)
               (set-cdr cell (cons x (cdr cell))))
             (get-list (cell)
               (prog1
                   (cdr cell)
                 (set-cdr cell nil))))
      (time (dotimes (g (* 1000 ngrp))
              (declare (fixnum g))
              (dotimes (ix 1000)
                (declare (fixnum ix))
                (append-item cell ix))
              (get-list cell)))
      )))

And indeed, it is fastest of all, at 1.4 sec. That’s more like what I would have expected.

So it seems that we are constructing closures more than once, possibly in each iteration of the loop at the call site of APPEND-ITEM and GET-LIST.

Why would the compiler need to do that?

- DM

On Feb 11, 2017, at 09:42, David McClain <dbm@refined-audiometrics.com> wrote:

.. and to my shock, and horror!, it actually happens that using the embedded LABELS functions is even worse than using CLOS.

(defun bake-off-raw (ngrp)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (time (dotimes (g (* 1000 ngrp))
              (dotimes (ix 1000)
                (append-item ix))
              (get-list)))
      )))


Median of 3 timings for (BAKE-OFF-RAW 100)  [ 100M conses ] is 16.432 sec. 

I couldn’t believe my eyes. There must be a mistake. I tried the BAKE-OFF-LBS (struct version) again to be sure. It still shows 1.8 sec. Then I tried BAKE-OFF-RAW yet again. Still 16 sec. 

What’s going on here?

- DM

On Feb 11, 2017, at 09:29, David McClain <dbm@refined-audiometrics.com> wrote:

ummm… yes, my DLAMBDA timings incorrectly stated that 10 groups took 60+ seconds. It took 6+ seconds for 10 groups, and scaling to 100 groups would have been the stated 60+ seconds. Sorry… End results remain the same, just mis-stated the case.

- DM


On Feb 11, 2017, at 09:09, David McClain <dbm@refined-audiometrics.com> wrote:

Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).

Normalized Timings (2 sig figs, median of 3 meas):

STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————

From time to time, I make use of Doug Hoyte’s DLAMBDA to make little embedded-intelligence objects that don’t seem to deserve the greater complexity of CLOS. DLAMBDA has the nice trait that it keeps message code adjacent to object state, rather than spreading it all out the way CLOS does. But I’m also a big fan of CLOS - don’t get me wrong.

But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.

As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.

So here is an example of the original code:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get-list ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (do ((x   start   (funcall succ-fn x)))
          ((funcall until-fn x) (get-list))
        (when (funcall pred (funcall key x))
          (append-item (funcall map-fn x)))
        ))))

#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).

This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.

It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results. 

They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.

So here is the first version of encapsulation done with DLAMBDA:

(defun make-list-builder ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 )))
      (dlambda
        (:append (x)
         (set-cdr (cons x (cdr cell))))
        (:get ()
         (prog1
             (cdr cell)
           (set-cdr nil)))
        ))))


and here is the corresponding rewrite of COLLECT->:

(defun collect-> (pred start until-fn succ-fn
                       &key
                       (key    #'identity)
                       (map-fn #'identity))
  #F
  (let ((lb (make-list-builder)))
    (do ((x  start  (funcall succ-fn x)))
        ((funcall until-fn x) (funcall lb :get))
      (when (funcall pred (funcall key x))
        (funcall lb :append (funcall map-fn x)))
      )))

DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.

Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:

(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
  (time (let ((lb (make-list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (funcall lb :append ix))
            (funcall lb :get)))))

When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————

So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.

(defclass list-builder ()
  ((cell  :accessor lb-cell  :initform nil)))

(defmethod lb-set-cdr ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x)))
    ))

(defmethod lb-append ((lb list-builder) x)
  #F
  (with-accessors ((cell lb-cell)) lb
    (lb-set-cdr lb (cons x (cdr cell)))
    ))

(defmethod lb-get ((lb list-builder))
  #F
  (with-accessors ((cell lb-cell)) lb
    (prog1
        (cdr cell)
      (lb-set-cdr lb nil))
    ))


Its bake-off code looks like this:

(defun bake-off-clos (ngrp)
  (time (let ((lb (make-instance 'list-builder)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lb-append lb ix))
            (lb-get lb)))))


And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster). 

This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!

————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?

(defun make-list-builder-v ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
      (let ((jv (vector #'append-item #'get)))
        (lambda (ix &rest args)
          (apply (aref jv ix) args))
        ))))


And here is the bake-off code:

(defmacro jv (jv ix &rest args)
  `(funcall ,jv ,ix ,@args))

(defun bake-off-jv (ngrp)
  (time (let ((lb (make-list-builder-v)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv lb 0 ix))
            (jv lb 1)))))

Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?

The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.

—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?

(defun make-list-builder-v2 ()
  ;; An O(1) list builder object, in FIFO order. Fast append uses a
  ;; circular list with cell pointing at last cdr, front of list
  ;; always in (cdr cell)
  #F
  (let (cell)
    (labels ((set-cdr (x)
               (when-let (xcell (or cell x))
                 (setf cell (setf (cdr xcell) x))
                 ))
             (append-item (x)
               (set-cdr (cons x (cdr cell))))
             (get ()
               (prog1
                   (cdr cell)
                 (set-cdr nil))))
     (vector #'append-item #'get)
     )))

and the bake-off code:

(defmacro jv2 (jv ix &rest args)
  `(funcall (aref ,jv ,ix) ,@args))

(defun bake-off-jv2 (ngrp)
  (time (let ((lb (make-list-builder-v2)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (jv2 lb 0 ix))
            (jv2 lb 1)))))


It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…

(defstruct lbs
  cell)

(defun lbs-set-cdr (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (when-let (xcell (or cell x))
      (setf cell (setf (cdr xcell) x))
      )))

(defun lbs-append (lbs x)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (lbs-set-cdr lbs (cons x (cdr cell)))
    ))

(defun lbs-get (lbs)
  #F
  (with-accessors ((cell lbs-cell)) lbs
    (prog1
        (cdr cell)
      (lbs-set-cdr lbs nil))
    ))

and its bake-off code:

(defun bake-off-lbs (ngrp)
  (time (let ((lbs (make-lbs)))
          (dotimes (g (* 1000 ngrp))
            (dotimes (ix 1000)
              (lbs-append lbs ix))
            (lbs-get lbs)))
        ))


OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!

- DM














--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.







--
Pascal Costanza
The views expressed in this email are my own, and not those of my employer.





Updated at: 2020-12-10 08:31 UTC