Lisp HUG Maillist Archive

Is there a way to have a tcp stream being interpreted as utf-8?

Hello,

I'm creating a tcp connection between two programs, one of which is
written in Lispworks. I'm writing s-expressions on the stream from the
"other" program, and Lispworks READs this and then evals it. This
works fine with one exception: the "other" program uses utf-8 and
Lispworks converts each byte to a character, which causes some
characters to be incorrect. I was looking for something like
:external-format which deals with this for files, but havent'
succeeded.

Does anyone know how to achieve this?


Thanks,

Kristoffer Kvello


Re: Is there a way to have a tcp stream being interpreted as utf-8?

Hi Kristoffer,

On 15 juin 2011, at 11:32, Kristoffer Kvello wrote:

> I'm creating a tcp connection between two programs, one of which is
> written in Lispworks. I'm writing s-expressions on the stream from the
> "other" program, and Lispworks READs this and then evals it. This
> works fine with one exception: the "other" program uses utf-8 and
> Lispworks converts each byte to a character, which causes some
> characters to be incorrect. I was looking for something like
> :external-format which deals with this for files, but havent'
> succeeded.
> 
> Does anyone know how to achieve this?

I'm, not sure if that would work, but have you tried to first read-sequence on the input stream, then call:

(external-format:decode-external-string sequence :utf-8)

Which will return a lisp string, with the proper encoding interpretation.  You can then turn the string into a sexp using read-from-string.

The remaining problem with the TCP stream is that you may receive only a part of the sexp if both programs are running on different network locations.  If both are on the localhost, then you're fine.


Best Regards,
Camille


Re: Is there a way to have a tcp stream being interpreted as utf-8?

On Wed, Jun 15, 2011 at 12:00 PM, Camille Troillard
<camille@osculator.net> wrote:
> Hi Kristoffer,
>
> On 15 juin 2011, at 11:32, Kristoffer Kvello wrote:
>
>> I'm creating a tcp connection between two programs, one of which is
>> written in Lispworks. I'm writing s-expressions on the stream from the
>> "other" program, and Lispworks READs this and then evals it. This
>> works fine with one exception: the "other" program uses utf-8 and
>> Lispworks converts each byte to a character, which causes some
>> characters to be incorrect. I was looking for something like
>> :external-format which deals with this for files, but havent'
>> succeeded.
>>
>> Does anyone know how to achieve this?
>
> I'm, not sure if that would work, but have you tried to first read-sequence on the input stream, then call:
>
> (external-format:decode-external-string sequence :utf-8)
>
> Which will return a lisp string, with the proper encoding interpretation.  You can then turn the string into a sexp using read-from-string.
>
> The remaining problem with the TCP stream is that you may receive only a part of the sexp if both programs are running on different network locations.  If both are on the localhost, then you're fine.
>
>
> Best Regards,
> Camille
>
>

I use the following in certain cases where I have a string that has
been treated as a direct 8-bit encoding, but should have been utf-8.
Supplied under the "Both-Parts Licence": if it breaks (or is already
broken), you get to keep both parts :-)

An alternative might be to use external-forma:decode-external string
(as Camille suggests), or using flexi-streams:octets-to-styring and
flexi-streams:string-to-octets, but my testing indicates that the code
below is noticeably faster (or, if not "noticeably", then at least
"measurably").

(defun reinterpret-string-as-utf8 (string)
  "Useful when a string has been generated from an external source,
where the source data has been interpreted as an 8-bit (direct)
encoding - e.g, :latin-1, but should have been interpreted as :utf-8."
  (declare (optimize (speed 3) (compilation-speed 0)
                     (debug 1) (safety 1)))
  (let ((output-chars-remaining 0)
        (output-accumulator 0))
    (with-output-to-string (s nil :element-type 'character)
      (flet ((output-byte (byte)
               (if (zerop output-chars-remaining)
                 (if (= (logand byte #x80) 0)
                   (write-char (code-char byte) s)
                   (cond ((= (logand byte #xe0) #xc0)
                          (setf output-accumulator (logand byte #x1f))
                          (setf output-chars-remaining 1))
                         ((= (logand byte #xf0) #xe0)
                          (setf output-accumulator (logand byte #x0f))
                          (setf output-chars-remaining 2))
                         ((= (logand byte #xf8) #xf0)
                          (setf output-accumulator (logand byte #x07))
                          (setf output-chars-remaining 3))
                         ((= (logand byte #xfc) #xf8)
                          (setf output-accumulator (logand byte #x03))
                          (setf output-chars-remaining 4))
                         ((= (logand byte #xe0) #xfe)
                          (setf output-accumulator (logand byte #x01))
                          (setf output-chars-remaining 5))
                         (t (error "Invalid UTF-8 byte ~A" byte))))
                 (progn
                   (assert (= (logand byte #xc0) #x80))
                   (setf output-accumulator (logior (ash output-accumulator 6)
                                                    (logand byte #x3f)))
                   (decf output-chars-remaining)
                   (when (zerop output-chars-remaining)
                     (write-char (code-char output-accumulator) s))))))
        (loop for c across string
              do (output-byte (char-code c))
              finally (assert (= output-chars-remaining 0)))))))

#|| ; test forms
(defparameter *test-str*
  (let ((str-1 (external-format:decode-external-string
(external-format:encode-lisp-string "æøåÆØÅtest" :utf-8) :latin-1)))
    (with-output-to-string (s)
      (dotimes (n 1000)
        (write-string str-1 s))
      s)))

(time (values
       (dotimes (n 100)
         (reinterpret-string-as-utf8 *test-str*))))

(time (values
       (dotimes (n 100)
         (external-format:decode-external-string
          (map '(simple-array (unsigned-byte 8) 1)
               #'char-code *test-str*)
          :utf-8))))

(time (values
       (dotimes (n 100)
         (external-format:decode-external-string
          (external-format:encode-lisp-string *test-str* :latin-1)
          :utf-8))))

(time (values
       (dotimes (n 100)
         (flexi-streams:octets-to-string
          (flexi-streams:string-to-octets *test-str*)
          :external-format :utf-8))))

||#


Re: Is there a way to have a tcp stream being interpreted as utf-8?

Thanks so much both of you! I got a tip to use Flexistreams, and I've
just concluded that this library solves my problem with almost no need
to modify my code. So I thank you too, Edi!

(I ended up taking the tcp stream (which is binary) and creating an
utf-8 flexistream from it and then simply passing that on.)


Kristoffer




On Wed, Jun 15, 2011 at 1:15 PM, Raymond Wiker <rwiker@gmail.com> wrote:
>
> On Wed, Jun 15, 2011 at 12:00 PM, Camille Troillard
> <camille@osculator.net> wrote:
>> Hi Kristoffer,
>>


Updated at: 2020-12-10 08:37 UTC