Today we continue to review web crawling tools from @shinmera.

This review will be short because I'll reuse code from the previous post and improve it by replacing four lines with only one.

A system CLSS allows you to use CSS3 selectors when working with HTML nodes produced by Plump.

Here is how we can improve our simple Twitter crawler:

POFTHEDAY> (defvar *raw-html*
             (dex:get ""))

POFTHEDAY> (defvar *html* (plump:parse *raw-html*))

           ;; Now I'll replaced these lines
           ;; with one clss:select call:
           ;; (remove-if-not (lambda (div)
           ;;                  (str:containsp "tweet-text"
           ;;                                 (plump:attribute div "class")))
           ;;                (plump:get-elements-by-tag-name *html* "p"))
POFTHEDAY> (defparameter *posts*
             (clss:select ".tweet-text" *html*))

POFTHEDAY> (type-of *posts*)

POFTHEDAY> (loop repeat 5
                 for post across *posts*
                 for full-text = (plump:render-text post)
                 for short-text = (str:shorten 40 full-text)
                 do (format t "- ~A~2%" short-text))
- Hi, I'm a #gamedev. My latest project...

- いらっしゃいませ~!

- The logic of

- The AI is extremely rough still, but ...


As a bonus, I want to show you that CLSS supports even pseudoclasses:

POFTHEDAY> (plump:parse "
  <li>First item</li>
  <li>Second item</li>
  <li>Third item</li>
#<PLUMP-DOM:ROOT {10031A93A3}>

POFTHEDAY> (clss:select "li:first-child"
#(#<PLUMP-DOM:ELEMENT li {100322C883}>)

POFTHEDAY> (plump:serialize * nil)
"<li>First item</li>"

Sadly, the documentation says only that it supports almost all CSS3 selectors, but don't enumerate them. However, we can learn this from sources:

POFTHEDAY> (rutils:hash-table-keys

There is a clss:define-pseudo-selector macro which allows defining a custom pseudo-selector.

Yesterday we'll learn about a more sophisticated tool for web scraping - lQuery.

