Speed of various implementation schemes...
Summary: Using basic STRUCT and direct function calls provides the fastest performing encapsulation of state and behavior. Significantly faster than CLOS, hugely faster than DLAMBDA, and even faster than using vectored function dispatch through a jump vector (!!).
Normalized Timings (2 sig figs, median of 3 meas):
STRUCT 1.0
Jump Vector 1.4
JV Dispatch 2.5
CLOS 5.0
DLAMBDA 33
———————
But just for curiosity, I wanted to see what the performance overhead on DLAMBDA is. And once I saw that, compared to CLOS, I began looking for what is the fastest possible implementation (apart from coding in Assembly language…). Let’s keep this all in Lisp.
As a simple test, I needed a fast list constructor that could keep items in LIFO order, and provide a fast O(1) append of each additional item as they become available. I initially set this up using internal LABEL functions, and it worked very well. And it also turned out to be useful in several other places. So I decided to package it up for myself and keep the implementation in one place, instead of replicating it all over the place — the usual reason for packaging up state and behavior into objects.
So here is an example of the original code:
(defun collect-> (pred start until-fn succ-fn
&key
(key #'identity)
(map-fn #'identity))
#F
(let (cell)
(labels ((set-cdr (x)
(when-let (xcell (or cell x))
(setf cell (setf (cdr xcell) x))))
(append-item (x)
(set-cdr (cons x (cdr cell))))
(get-list ()
(prog1
(cdr cell)
(set-cdr nil))))
(do ((x start (funcall succ-fn x)))
((funcall until-fn x) (get-list))
(when (funcall pred (funcall key x))
(append-item (funcall map-fn x)))
))))
#F is a reader macro meaning (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0) (FLOAT 0)).
This code offers an O(1) FIFO list construction, for items from some collection START which satisfy some predicate PRED (using key function KEY), and which collects some mapped representation (MAP-FN) of the qualified items. Pretty simple.
It is that collection of 3 functions inside the LABELS clause that I want to package up. No clients need to know about the internal workings and state of the list constructor object. They just need to be able to instantiate such a constructor, call APPEND-ITEM on each value they want collected, and GET-LIST at the end to get their collected results.
They don’t need to know anything about the CELL, nor the SET-CDR mechanism, which together manages a list by tracking the tail cell of a circular list. The head of the accumulating list is always pointed to by the (CDR CELL). At the end, when they want the result, we need to furnish the head of the accumulated list, and null out the cdr of the last cons cell. Perfect conditions for an encapsulated object implementation.
So here is the first version of encapsulation done with DLAMBDA:
(defun make-list-builder ()
;; An O(1) list builder object, in FIFO order. Fast append uses a
;; circular list with cell pointing at last cdr, front of list
;; always in (cdr cell)
#F
(let (cell)
(labels ((set-cdr (x)
(when-let (xcell (or cell x))
(setf cell (setf (cdr xcell) x))
)))
(dlambda
(:append (x)
(set-cdr (cons x (cdr cell))))
(:get ()
(prog1
(cdr cell)
(set-cdr nil)))
))))
and here is the corresponding rewrite of COLLECT->:
(defun collect-> (pred start until-fn succ-fn
&key
(key #'identity)
(map-fn #'identity))
#F
(let ((lb (make-list-builder)))
(do ((x start (funcall succ-fn x)))
((funcall until-fn x) (funcall lb :get))
(when (funcall pred (funcall key x))
(funcall lb :append (funcall map-fn x)))
)))
DLAMBDA is sensitive to the ordering of the message responders, so I placed :APPEND first.
Okay, so now let’s do some timings. I want to measure realistic applications of the list builder, not making lists of hundreds of millions of items, but perhaps a thousand items, done over and over again. To get some realistic durations, we need to execute a few million times. So here is the template of our timing tests:
(defun bake-off-dlambda (ngrp) ;; 1 grp = 1000 times of 1000-element lists
(time (let ((lb (make-list-builder)))
(dotimes (g (* 1000 ngrp))
(dotimes (ix 1000)
(funcall lb :append ix))
(funcall lb :get)))))
When I execute this bake-off code for 10 groups (10,000,000 conses) I got a median of 3 measurement of 60.38 sec. under LWM 7/64 on a Mac Mini (Late 2014) running OS X Sierra 10.12.3, and nothing much else happening in the machine.
———————————————
So let’s look at a CLOS implementation. Sadly, to me, this takes all that encapsulation and scatters it to the wind.
(defclass list-builder ()
((cell :accessor lb-cell :initform nil)))
(defmethod lb-set-cdr ((lb list-builder) x)
#F
(with-accessors ((cell lb-cell)) lb
(when-let (xcell (or cell x))
(setf cell (setf (cdr xcell) x)))
))
(defmethod lb-append ((lb list-builder) x)
#F
(with-accessors ((cell lb-cell)) lb
(lb-set-cdr lb (cons x (cdr cell)))
))
(defmethod lb-get ((lb list-builder))
#F
(with-accessors ((cell lb-cell)) lb
(prog1
(cdr cell)
(lb-set-cdr lb nil))
))
Its bake-off code looks like this:
(defun bake-off-clos (ngrp)
(time (let ((lb (make-instance 'list-builder)))
(dotimes (g (* 1000 ngrp))
(dotimes (ix 1000)
(lb-append lb ix))
(lb-get lb)))))
And its median of 3 for 100 groups of 1M conses showed a timing of 8.952 sec. Note that this was for 10x as many conses as we used when timing the DLAMBDA implementation. In other words, CLOS is nearly 6x faster than DLAMBA. (That’s close enough for Astronomical accuracy to call it 10x faster).
This is very surprising!! At least to me… and it shows that LW has spent some effort to make CLOS efficient. A tip of the hat to them!
————————
Okay… what’s faster than all this so far? What about a jump vector dispatching to internal LABEL functions?
(defun make-list-builder-v ()
;; An O(1) list builder object, in FIFO order. Fast append uses a
;; circular list with cell pointing at last cdr, front of list
;; always in (cdr cell)
#F
(let (cell)
(labels ((set-cdr (x)
(when-let (xcell (or cell x))
(setf cell (setf (cdr xcell) x))
))
(append-item (x)
(set-cdr (cons x (cdr cell))))
(get ()
(prog1
(cdr cell)
(set-cdr nil))))
(let ((jv (vector #'append-item #'get)))
(lambda (ix &rest args)
(apply (aref jv ix) args))
))))
And here is the bake-off code:
(defmacro jv (jv ix &rest args)
`(funcall ,jv ,ix ,@args))
(defun bake-off-jv (ngrp)
(time (let ((lb (make-list-builder-v)))
(dotimes (g (* 1000 ngrp))
(dotimes (ix 1000)
(jv lb 0 ix))
(jv lb 1)))))
Yech!… we need to know that function 0 is the APPEND-ITEM, and function 1 is the GET-LIST. But is it faster?
The median of 3 timings on 100M conses was 4.537s. So yes, it is about 2x faster than CLOS. Unless you have a strong need for speed, stick with CLOS.
—————————————————
Can we do any better than JV? What about skipping the internal dispatch lambda shown above, and just vectoring ourselves through the jump-vector?
(defun make-list-builder-v2 ()
;; An O(1) list builder object, in FIFO order. Fast append uses a
;; circular list with cell pointing at last cdr, front of list
;; always in (cdr cell)
#F
(let (cell)
(labels ((set-cdr (x)
(when-let (xcell (or cell x))
(setf cell (setf (cdr xcell) x))
))
(append-item (x)
(set-cdr (cons x (cdr cell))))
(get ()
(prog1
(cdr cell)
(set-cdr nil))))
(vector #'append-item #'get)
)))
and the bake-off code:
(defmacro jv2 (jv ix &rest args)
`(funcall (aref ,jv ,ix) ,@args))
(defun bake-off-jv2 (ngrp)
(time (let ((lb (make-list-builder-v2)))
(dotimes (g (* 1000 ngrp))
(dotimes (ix 1000)
(jv2 lb 0 ix))
(jv2 lb 1)))))
It’s timing showed a median of 3 of 2.527s. So yes, skipping the lambda dispatch is a speedup. Still…. hmmm….
———————————
Well, just for the heck of it, let’s see where a basic STRUCT implementation falls with CLOS and DLAMBDA…
(defstruct lbs
cell)
(defun lbs-set-cdr (lbs x)
#F
(with-accessors ((cell lbs-cell)) lbs
(when-let (xcell (or cell x))
(setf cell (setf (cdr xcell) x))
)))
(defun lbs-append (lbs x)
#F
(with-accessors ((cell lbs-cell)) lbs
(lbs-set-cdr lbs (cons x (cdr cell)))
))
(defun lbs-get (lbs)
#F
(with-accessors ((cell lbs-cell)) lbs
(prog1
(cdr cell)
(lbs-set-cdr lbs nil))
))
and its bake-off code:
(defun bake-off-lbs (ngrp)
(time (let ((lbs (make-lbs)))
(dotimes (g (* 1000 ngrp))
(dotimes (ix 1000)
(lbs-append lbs ix))
(lbs-get lbs)))
))
OMG!! It’s timing for 100M conses shows a median of 3 of 1.807s. Not only is this faster than both DLAMBDA and CLOS, but it is even faster than using jump-vectors !!!! What a surprise!
- DM