RSS Feed

Lisp Project of the Day

re

You can support this project by donating at:

Donate using PatreonDonate using Liberapay

Or see the list of project sponsors.

retext-processing

This is a small regular expressions engine, made by Jeffrey Massung (@codeninja_blog).

This engine uses Lua's regexp syntax.

By the way, I was interested in why did Lua developers choose such syntax over common Posix/Perl syntax. I found this interesting email from Philippe Lhoste:

http://lua-users.org/lists/lua-l/2001-04/msg00244.html

In short, this alternative syntax was chosen to not support all the features from Perl regexps and to make Lua's implementation short.

Today lua code contains 628 lines with regexp implementation:

https://github.com/lua/lua/blob/e4607523234f16ed9ed0436340b9315377dbfe7f/lstrlib.c#L352-L980

Common Lisp version is about 645 lines but is based on a separate parser library.

I am wondering how does re's performance compare with cl-ppcre? Let's check!

We will test performance on a simple example. It will extract a subreddit's name from the URL:

POFTHEDAY> (cl-ppcre:register-groups-bind
               (subreddit)
               (".*/r/(.*?)/"
                "https://www.reddit.com/r/Common_Lisp/")
             subreddit)
"Common_Lisp"

POFTHEDAY> (first
            (re:match-groups
             (re:match-re #r".*/r/(.-)/"
                          "https://www.reddit.com/r/Common_Lisp/")))
"Common_Lisp"

Here is what I've got running cl-ppcre 1 million times:

POFTHEDAY> (time
            (loop repeat *num-iterations*
                  do (cl-ppcre:register-groups-bind
                         (subreddit)
                         (".*/r/(.*?)/"
                          "https://www.reddit.com/r/Common_Lisp/")
                       subreddit)))
Evaluation took:
  0.731 seconds of real time
  0.615337 seconds of total run time (0.541851 user, 0.073486 system)
  [ Run times consist of 0.061 seconds GC time, and 0.555 seconds non-GC time. ]
  84.13% CPU
  1,614,120,972 processor cycles
  956 page faults
  160,002,992 bytes consed

And here is the result of the same amount of iterations for 're':

POFTHEDAY> (time
            (loop repeat *num-iterations*
                  do (first
                      (re:match-groups
                       (re:match-re #r".*/r/(.-)/"
                                    "https://www.reddit.com/r/Common_Lisp/")))))
Evaluation took:
  14.116 seconds of real time
  14.126191 seconds of total run time (14.056847 user, 0.069344 system)
  [ Run times consist of 0.301 seconds GC time, and 13.826 seconds non-GC time. ]
  100.07% CPU
  31,167,841,926 processor cycles
  2 page faults
  3,824,010,656 bytes consed

As you can see, cl-ppre is 20 times faster and consumes less memory than 're'.

It would be interesting to compare cl-ppcre with other regexp engines. I've found a test suite which already compares many engines:

https://github.com/rust-leipzig/regex-performance

But that is the story for another thread.


Brought to you by 40Ants under Creative Commons License