RSS Feed

Lisp Project of the Day

parseq

You can support this project by donating at:

Donate using PatreonDonate using Liberapay

Or see the list of project sponsors.

parseqparsingtext-processing

Documentation😀
Docstrings😀
Tests 😀
Examples😀
RepositoryActivity🥺
CI 🥺

With this library, you can write parsers to process strings, lists and binary data!

Let's take a look at one of the examples. It is a parser for the dates from RFC 5322. This format is used in email messages:

Thu, 13 Jul 2017 13:28:03 +0200

Parser consist of rules, combined in different ways. We'll go through the parser's parts one by one.

This simple rule matches one space character:

POFTHEDAY> (parseq:defrule FWS ()
               #\space)

;; It matches if string contains one space
POFTHEDAY> (parseq:parseq 'FWS
                          " ")
#\ 
T

;; But not on string from many spaces:
POFTHEDAY> (parseq:parseq 'FWS
                          "   ")
NIL
NIL

;; And of cause not on some other string
POFTHEDAY> (parseq:parseq 'FWS
                          "foo")
NIL
NIL

The next rule we need is the rule to parse hours, minutes and seconds. These parts have two digits and we'll use rep expression to specify how many digits matches the rule:

POFTHEDAY> (parseq:defrule hour ()
               (rep 2 digit))

POFTHEDAY> (parseq:parseq 'hour
                          "15")
(#\1 #\5)
T

See, this rule returns digits as the list! But to make it useful, we need the integer. Parseq rules support different kinds of transformations. They can are optional and can be specified like this:

;; This transformation will return as the string instead of list:
POFTHEDAY> (parseq:defrule hour ()
               (rep 2 digit)
             (:string))

POFTHEDAY> (parseq:parseq 'hour
                          "15")
"15"
T

;; Now we'll add a transformation from string to integer:
POFTHEDAY> (parseq:defrule hour ()
               (rep 2 digit)
             (:string)
             (:function #'parse-integer))

POFTHEDAY> (parseq:parseq 'hour
                          "15")
15 (4 bits, #xF, #o17, #b1111)
T

We'll define the minute and second rules the same way.

The next rule matches the abbreviated day of the week. It combines other rules or terms using or expression:

POFTHEDAY> (parseq:defrule day-of-week ()
               (or "Mon" "Tue" "Wed"
                   "Thu" "Fri" "Sat"
                   "Sun"))

POFTHEDAY> (parseq:parseq 'day-of-week
                          "Friday")
NIL
NIL

POFTHEDAY> (parseq:parseq 'day-of-week
                          "Fri")
"Fri"
T

;; The same way we define a rule for month abbrefiation
POFTHEDAY> (parseq:defrule month ()
               (or "Jan" "Feb" "Mar" "Apr"
                   "May" "Jun" "Jul" "Aug"
                   "Sep" "Oct" "Nov" "Dec"))

A little bit complex rule is used for matching timezone. Timezone is a string from 4 digits prefixed by plus or minus sign. We'll combine this knowledge using or/and expressions and will use option :string to get results as a single string:

POFTHEDAY> (parseq:defrule zone ()
               (and (or "+" "-")
                    (rep 4 digit))
             (:string))

POFTHEDAY> (parseq:parseq 'zone
                          "0300")
NIL
NIL

POFTHEDAY> (parseq:parseq 'zone
                          "+0300")
"+0300"
T

POFTHEDAY> (parseq:parseq 'zone
                          "-0300")
"-0300"
T

Now let's return to the time of day parsing. According to the RFC, seconds part is optional. Parseq has an expression ? to match optional rules.

Here is how a rule matching the time of day will look like:

POFTHEDAY> (parseq:defrule time-of-day ()
               (and hour
                    ":"
                    minute
                    (? (and ":" second))))

POFTHEDAY> (parseq:parseq 'time-of-day
                          "10:31:05")
(10 ":" 31 (":" 5))
T

To make the rule return only digits we have to use :choose transform. Choose extracts from results by index. You can specify index as an integer or as a list if you need to extract the value from the nested list:

POFTHEDAY> (parseq:defrule time-of-day ()
               (and hour
                    ":"
                    minute
                    (? (and ":" second)))
             (:choose 0 2 '(3 1)))

POFTHEDAY> (parseq:parseq 'time-of-day
                          "10:31:05")
(10 31 5)

;; Seconds are optional because of ? expression:
POFTHEDAY> (parseq:parseq 'time-of-day
                          "10:31")
(10 31 NIL)
T

;; This (:choose 0 2 '(3 1)) is equivalent to:
POFTHEDAY> (let ((r '(10 ":" 31 (":" 5))))
             (list (elt r 0)
                   (elt r 2)
                   (elt (elt r 3)
                        1)))
(10 31 5)

Another interesting transformation rule is :flatten. It is used to "streamline" result having nested structure and used in this rule which matches both time of day and timezone:

;; Without flatten we'll get nested lists:
POFTHEDAY> (parseq:defrule time ()
               (and time-of-day FWS zone)
             (:choose 0 2))

POFTHEDAY> (parseq:parseq 'time
                          "10:31 +0300")
((10 31 NIL) "+0300")

POFTHEDAY> (parseq:defrule time ()
               (and time-of-day FWS zone)
             (:choose 0 2)
             (:flatten))

;; Pay attention, :flatten removes nils:
POFTHEDAY> (parseq:parseq 'time
                          "10:31 +0300")
(10 31 "+0300")
T

Now, knowing how rules are combined and data is transformed, you will be able to read rest rules yourself:

POFTHEDAY> (parseq:defrule day ()
               (and (? FWS)
                    (rep (1 2) digit)
                    FWS)
             (:choose 1)
             (:string)
             (:function #'parse-integer))

POFTHEDAY> (parseq:defrule year ()
               (and FWS
                    (rep 4 digit)
                    FWS)
             (:choose 1)
             (:string)
             (:function #'parse-integer))

POFTHEDAY> (parseq:defrule date ()
               (and day month year))

(parseq:defrule date-time ()
    (and (? (and day-of-week ","))
         date
         time)
  (:choose '(0 0) 1 2)
  (:flatten))

Another cool Parseq's feature is an ability to debug parser execution. Now I'll turn on this debug mode and parse a string:

POFTHEDAY> (parseq:trace-rule 'date-time :recursive t)

POFTHEDAY> (parseq:parseq 'date-time
                          "Thu, 13 Jul 2017 13:28:03 +0200")
1: DATE-TIME 0?
 2: DAY-OF-WEEK 0?
 2: DAY-OF-WEEK 0-3 -> "Thu"
 2: DATE 4?
  3: DAY 4?
   4: FWS 4?
   4: FWS 4-5 -> #\ 
   4: FWS 7?
   4: FWS 7-8 -> #\ 
  3: DAY 4-8 -> 13
  3: MONTH 8?
  3: MONTH 8-11 -> "Jul"
  3: YEAR 11?
   4: FWS 11?
   4: FWS 11-12 -> #\ 
   4: FWS 16?
   4: FWS 16-17 -> #\ 
  3: YEAR 11-17 -> 2017
 2: DATE 4-17 -> (13 "Jul" 2017)
 2: TIME 17?
  3: TIME-OF-DAY 17?
   4: HOUR 17?
   4: HOUR 17-19 -> 13
   4: MINUTE 20?
   4: MINUTE 20-22 -> 28
   4: SECOND 23?
   4: SECOND 23-25 -> 3
  3: TIME-OF-DAY 17-25 -> (13 28 3)
  3: FWS 25?
  3: FWS 25-26 -> #\ 
  3: ZONE 26?
  3: ZONE 26-31 -> "+0200"
 2: TIME 17-31 -> (13 28 3 "+0200")
1: DATE-TIME 0-31 -> ("Thu" 13 "Jul" 2017 13 28 3 "+0200")

("Thu" 13 "Jul" 2017 13 28 3 "+0200")
T

We can improve this parser by using :function transformation to return a local-time:timestamp. First, let's redefine rule for matching the month and make it return the month number:

POFTHEDAY> (parseq:defrule january  () "Jan" (:constant 1))
POFTHEDAY> (parseq:defrule february () "Feb" (:constant 2))
POFTHEDAY> (parseq:defrule march    () "Mar" (:constant 3))
POFTHEDAY> (parseq:defrule april    () "Apr" (:constant 4))
POFTHEDAY> (parseq:defrule may      () "May" (:constant 5))
POFTHEDAY> (parseq:defrule june     () "Jun" (:constant 6))
POFTHEDAY> (parseq:defrule july     () "Jul" (:constant 7))
POFTHEDAY> (parseq:defrule august   () "Aug" (:constant 8))
POFTHEDAY> (parseq:defrule september () "Sep" (:constant 9))
POFTHEDAY> (parseq:defrule october  () "Oct" (:constant 10))
POFTHEDAY> (parseq:defrule november () "Nov" (:constant 11))
POFTHEDAY> (parseq:defrule december () "Dec" (:constant 12))

POFTHEDAY> (parseq:defrule month ()
               (or january february march april
                   may june july august
                   september october november december))

POFTHEDAY> (parseq:parseq 'month "Sep")
9 (4 bits, #x9, #o11, #b1001)
T

Next, we need to reimplement the rule matching a timezone to make it return local-time:timezone.

We'll be using an advanced technique of binding variables to pass value from one rule to another, because I want to store the timezone as a string and to parse it's hour and minute parts simultaneously.

To accomplish this task, we have to divide or timezone matching rule into two. The first rule will match it as a string of sign and four digits. Then it will save the result into an external variable and exit with a nil result to give a chance to execute the second rule:

POFTHEDAY> (parseq:defrule zone-as-str ()
               (and (or #\+ #\-)
                    (rep 4 digit))
             (:string)
             (:external zone-as-str)
             ;; Save the value into a variable:
             (:lambda (z)
               (setf zone-as-str z))
             ;; and just exit:
             (:test (z)
               (declare (ignore z))
               nil))

Now we'll redefine our zone rule to call zone-as-str first and then to parse the same text again, this time as hours and minutes. As the final step, it creates a local-time:timezone object:

POFTHEDAY> (parseq:defrule zone ()
               (or zone-as-str
                   (and (or #\+ #\-)
                        hour
                        minute))
             (:let zone-as-str)
             (:lambda (sign hour minute)
               (local-time::%make-simple-timezone
                zone-as-str
                zone-as-str
                ;; This is an offset in seconds:
                (+ (* (ecase sign
                        (#\+ 1)
                        (#\- -1))
                      hour
                      3600)
                   (* minute 60)))))

;; Here is the execution trace:
POFTHEDAY> (parseq:parseq 'zone
                          "+0300")
1: ZONE 0?
 2: ZONE-AS-STR 0?
 2: ZONE-AS-STR -|
 2: HOUR 1?
 2: HOUR 1-3 -> 3
 2: MINUTE 3?
 2: MINUTE 3-5 -> 0
1: ZONE 0-5 -> #<LOCAL-TIME::TIMEZONE +0300>
#<LOCAL-TIME::TIMEZONE +0300>
T

Now we need to redefine the original date-time rule, to create local-time:timestamp as the result:

POFTHEDAY> (parseq:parseq 'date-time
                          "Thu, 13 Jul 2017 13:28:03 +0200")
("Thu" 13 7 2017 13 28 3 #<LOCAL-TIME::TIMEZONE +0200>)
T

POFTHEDAY> (parseq:defrule date-time ()
               (and (? (and day-of-week ","))
                    date
                    time)
             (:choose '(1 2)                       ; year
                      '(1 1)                       ; month
                      '(1 0)                       ; day
                      '(2 0)                       ; hour
                      '(2 1)                       ; minute
                      '(2 2)                       ; second
                      '(2 3))                      ; timezone
             
             (:lambda (year month day hour minute second timezone)
               (local-time:encode-timestamp
                0             ; nanoseconds
                (or second 0) ; secs are optional
                minute
                hour
                day
                month
                year
                :timezone (or timezone
                              local-time:*default-timezone*))))

POFTHEDAY> (parseq:parseq 'date-time
                          "Thu, 13 Jul 2017 13:28:03 +0200")
@2017-07-13T14:28:03.000000+03:00
T

I've got a different value for the time because local-time prints timestamp in my timezone which is UTC+3.

The cool feature of the Parseq is its ability to work with any data, including binary. This way it can be used to parse binary formats.

As an example of parsing binary data, Parseq includes this parser rules for working with PNG image format:

https://github.com/mrossini-ethz/parseq/blob/master/examples/png.lisp

There are other interesting features. Please, read the docs to learn more.

If you are aware of other parsing libraries which worth to be written about, let me know in the comments.


Brought to you by 40Ants under Creative Commons License