Re: Is there a way to have a tcp stream being interpreted as utf-8?
On Wed, Jun 15, 2011 at 12:00 PM, Camille Troillard
<camille@osculator.net> wrote:
> Hi Kristoffer,
>
> On 15 juin 2011, at 11:32, Kristoffer Kvello wrote:
>
>> I'm creating a tcp connection between two programs, one of which is
>> written in Lispworks. I'm writing s-expressions on the stream from the
>> "other" program, and Lispworks READs this and then evals it. This
>> works fine with one exception: the "other" program uses utf-8 and
>> Lispworks converts each byte to a character, which causes some
>> characters to be incorrect. I was looking for something like
>> :external-format which deals with this for files, but havent'
>> succeeded.
>>
>> Does anyone know how to achieve this?
>
> I'm, not sure if that would work, but have you tried to first read-sequence on the input stream, then call:
>
> (external-format:decode-external-string sequence :utf-8)
>
> Which will return a lisp string, with the proper encoding interpretation. You can then turn the string into a sexp using read-from-string.
>
> The remaining problem with the TCP stream is that you may receive only a part of the sexp if both programs are running on different network locations. If both are on the localhost, then you're fine.
>
>
> Best Regards,
> Camille
>
>
I use the following in certain cases where I have a string that has
been treated as a direct 8-bit encoding, but should have been utf-8.
Supplied under the "Both-Parts Licence": if it breaks (or is already
broken), you get to keep both parts :-)
An alternative might be to use external-forma:decode-external string
(as Camille suggests), or using flexi-streams:octets-to-styring and
flexi-streams:string-to-octets, but my testing indicates that the code
below is noticeably faster (or, if not "noticeably", then at least
"measurably").
(defun reinterpret-string-as-utf8 (string)
"Useful when a string has been generated from an external source,
where the source data has been interpreted as an 8-bit (direct)
encoding - e.g, :latin-1, but should have been interpreted as :utf-8."
(declare (optimize (speed 3) (compilation-speed 0)
(debug 1) (safety 1)))
(let ((output-chars-remaining 0)
(output-accumulator 0))
(with-output-to-string (s nil :element-type 'character)
(flet ((output-byte (byte)
(if (zerop output-chars-remaining)
(if (= (logand byte #x80) 0)
(write-char (code-char byte) s)
(cond ((= (logand byte #xe0) #xc0)
(setf output-accumulator (logand byte #x1f))
(setf output-chars-remaining 1))
((= (logand byte #xf0) #xe0)
(setf output-accumulator (logand byte #x0f))
(setf output-chars-remaining 2))
((= (logand byte #xf8) #xf0)
(setf output-accumulator (logand byte #x07))
(setf output-chars-remaining 3))
((= (logand byte #xfc) #xf8)
(setf output-accumulator (logand byte #x03))
(setf output-chars-remaining 4))
((= (logand byte #xe0) #xfe)
(setf output-accumulator (logand byte #x01))
(setf output-chars-remaining 5))
(t (error "Invalid UTF-8 byte ~A" byte))))
(progn
(assert (= (logand byte #xc0) #x80))
(setf output-accumulator (logior (ash output-accumulator 6)
(logand byte #x3f)))
(decf output-chars-remaining)
(when (zerop output-chars-remaining)
(write-char (code-char output-accumulator) s))))))
(loop for c across string
do (output-byte (char-code c))
finally (assert (= output-chars-remaining 0)))))))
#|| ; test forms
(defparameter *test-str*
(let ((str-1 (external-format:decode-external-string
(external-format:encode-lisp-string "æøåÆØÅtest" :utf-8) :latin-1)))
(with-output-to-string (s)
(dotimes (n 1000)
(write-string str-1 s))
s)))
(time (values
(dotimes (n 100)
(reinterpret-string-as-utf8 *test-str*))))
(time (values
(dotimes (n 100)
(external-format:decode-external-string
(map '(simple-array (unsigned-byte 8) 1)
#'char-code *test-str*)
:utf-8))))
(time (values
(dotimes (n 100)
(external-format:decode-external-string
(external-format:encode-lisp-string *test-str* :latin-1)
:utf-8))))
(time (values
(dotimes (n 100)
(flexi-streams:octets-to-string
(flexi-streams:string-to-octets *test-str*)
:external-format :utf-8))))
||#