Re: how much clear code should be clear
Several years ago I had solve a similar problem. I decided to use
Dick Waters series
package. This allowed me to factor the code into a nice set of
abstractions that was
easy to work with. The downside was a couple of truly horrific
procedures that took
care of the mismatch between the variable length unicode representations and the
fixed width representations. In any case, this is what the top level
code looked like:
(defun utf-8->ucs-2 (ustring)
"Convert a ustring to a lw:text-string."
(collect 'string
(#m code-char
(ucs-4->ucs-2
(utf-8->ucs-4
(scan 'simple-vector-8b
(code-unit-vector ustring)))))))
Which I think is really easy to read.
The definition of ucs-4->ucs-2 was simple as well:
(defun ucs-4->ucs-2 (ucs-4-series)
"Series transducer that maps unicode extension characters
into the unicode replacement character."
(declare (optimizable-series-function)
(type (series ucs-4-code) ucs-4-series))
(map-fn 'ucs-2-code (lambda (code)
(declare (type ucs-4-code code))
(if (<= code #xFFFF)
code
#xFFFD)) ;; unicode replacement char
ucs-4-series))
The hard part was transducing utf-8 to ucs-4. For this I needed a
series macro that
allows you to `look ahead' several elements in the series:
(defmacro scan-ahead (series bindings default &body body)
"Binds variables in BINDINGS to the series, the series displaced by one,
the series displaced by two, etc. In effect, allows a fixed amount of
lookahead for the body. The value of default is used to `pad out' the
tail of the series and will appear as the value of the last elements
near the end. Example:
(scan-ahead #z(foo bar baz quux) (a b c) 'xyzzy ....)
will bind A to the series #z(foo bar baz quux)
B to the series #z(bar baz quux xyzzy) and
C to the series #z(baz quux xyzzy xyzzy)"
(let ((n (length bindings)))
`(MULTIPLE-VALUE-BIND ,bindings
(CHUNK ,n 1 (CATENATE ,series (SUBSERIES (SERIES ,default) 1 ,n)))
,@body)))
This is used to peek ahead in the utf-8 octet series. Then we need a
transducing macro
that takes the multiple series generated above and produces the output
series. It does
a small code-walk to destructure a literal lambda expression. This
gives the illusion that
it is a simple function that takes a functional argument:
(defmacro decode-series (series &key (padding nil) decoder (arity nil))
"DECODER must be a literal lambda expression with only REQUIRED args.
The decoder is invoked in turn on each overlapping n-tuple (where n
is the number of args) in the series. The decoder should return
two values: the decoded value and a `skip' count. The decoded value
is placed in the output followed by `skip' copies of PADDING.
"
(destructure-function-lambda
arity decoder
(lambda (bindings docstring decls body)
(declare (ignore docstring decls))
(with-unique-names (output-code count)
`(SCAN-AHEAD ,series ,bindings ,padding
(MULTIPLE-VALUE-BIND (,output-code ,count)
(COLLECTING-FN '(VALUES T FIXNUM)
(LAMBDA () (VALUES ,padding 0))
(LAMBDA (,output-code ,count ,@bindings)
(IF (> ,count 0)
(VALUES ,padding (1- ,count))
(PROGN ,@body)))
,@bindings)
,output-code))))
(lambda () (error "Literal lambda needed for decode-series."))))
Now that's an ugly macro. However, it isn't just useful for decoding unicode.
Example: Decode the %nn parts of a URI.
(decode-series uri-characters
:padding #\null
:decoder (lambda (c0 c1 c2) ;; lookahead by two characters
(if (char/= c0 #\%)
(values c0 0) ;; if not a %, pass it through
;; Otherwise, decode the following characters,
;; and return that value. Use #\NULL for next two
;; outputs.
(values (+ (* (digit-char-p c1 16) 16) (digit-char-p c2 16))
2))))
The performance was pretty good, if I recall. The series package
code-walks all these
forms and turns utf-8->ucs-2 into a huge tagbody that is quite efficient.
On Thu, Dec 15, 2011 at 11:50 AM, Art Obrezan <artobrezan@yahoo.com> wrote:
> 2) how do you keep your lisp code clean and clear?
Abstraction. Always look for a way to separate the problem into
simple building blocks.
In this case, the series abstraction was the natural mechanism.
--
~jrm