Lisp HUG Maillist Archive

encoding: detecting external format is mistaken?

Hi.

I need some help vs. encoding issues in external files.

LW 7.1.2 on Linux, with various 'UTF-8' everywhere in the locale's.
With the defaults in place, LW goes wrong with UTF-8 files, while
LATIN-1 and UNICODE/UTF-16 are ok.

Comparing the output of #'stream-external-format and the unix 'file'
command, using 3 copies of a sample file - encoded in resp. UTF-8,
LATIN-1 and UTF-16:

  (mapcar #'(lambda (f)
              (list
               (let ((out (sys:run-shell-command (format nil "file ~A" f)
                                        :wait nil :output :stream)))
                 (with-open-stream (out out) (read-line out)))
               (with-open-file (ss f)
                 (stream-external-format ss))))
          '("test.utf8.txt" "test.latin1.txt" "test.unicode.txt"))

yields:

  (("test.utf8.txt: UTF-8 Unicode text" (:LATIN-1 :EOL-STYLE :LF))
   ("test.latin1.txt: ISO-8859 text" (:LATIN-1 :EOL-STYLE :LF))
   ("test.unicode.txt: Little-endian UTF-16 Unicode text" (:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :LF)))

Note the :LATIN-1 suggestion for the UTF-8 file.

The 3 files:

Re: encoding: detecting external format is mistaken?

Assuming that this happens as part of an asdf/quicklisp operation, it may be sufficient to add

(pushnew :asdf-unicode *features*)

to your .lispworks file.


On 29 Oct 2020, at 14:15, anders.vinjar@bek.no wrote:

Hi.

I need some help vs. encoding issues in external files.

LW 7.1.2 on Linux, with various 'UTF-8' everywhere in the locale's.
With the defaults in place, LW goes wrong with UTF-8 files, while
LATIN-1 and UNICODE/UTF-16 are ok.

Comparing the output of #'stream-external-format and the unix 'file'
command, using 3 copies of a sample file - encoded in resp. UTF-8,
LATIN-1 and UTF-16:

 (mapcar #'(lambda (f)
             (list
              (let ((out (sys:run-shell-command (format nil "file ~A" f)
                                       :wait nil :output :stream)))
                (with-open-stream (out out) (read-line out)))
              (with-open-file (ss f)
                (stream-external-format ss))))
         '("test.utf8.txt" "test.latin1.txt" "test.unicode.txt"))

yields:

 (("test.utf8.txt: UTF-8 Unicode text" (:LATIN-1 :EOL-STYLE :LF))
  ("test.latin1.txt: ISO-8859 text" (:LATIN-1 :EOL-STYLE :LF))
  ("test.unicode.txt: Little-endian UTF-16 Unicode text" (:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :LF)))

Note the :LATIN-1 suggestion for the UTF-8 file.

The 3 files:
<test-files.tar.bz2>
Opening the UTF-8 file in the lw editor presents garbled letters
('æ'...), and as well reports it erroneously to be LATIN-1

*specific-valid-file-encodings* and *file-encoding-detection-algorithm*
are left as default, but: (lw::set-default-character-element-type 'character)

I could force UTF-8, but then get similar problems with any LATIN-1
files.

Any suggestions?

Thanks,

-anders

Re: encoding: detecting external format is mistaken?

    R> Assuming that this happens as part of an asdf/quicklisp
    R> operation, it may be sufficient to add (pushnew :asdf-unicode
    R> *features*)

Thanks.

Unfortunately, the behavior has nothing to do with any asdf operation,
and adding :asdf-unicode does not change the behavior.

The minimal test case included in the original post is possibly self
sufficient in any plain LW 7.1.

-anders

_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html

Re: encoding: detecting external format is mistaken?

>>>>> On Thu, 29 Oct 2020 14:15:10 +0100, anders vinjar said:
> 
> Opening the UTF-8 file in the lw editor presents garbled letters
> ('æ'...), and as well reports it erroneously to be LATIN-1
> 
> *specific-valid-file-encodings* and *file-encoding-detection-algorithm*
> are left as default, but: (lw::set-default-character-element-type 'character)
> 
> I could force UTF-8, but then get similar problems with any LATIN-1
> files.
> 
> Any suggestions?

LispWorks doesn't have any built-in detection of UTF-8, except when there is a
byte order mark at the start.

Try pushing :utf-8 on sys:*specific-valid-file-encodings*.  This will make it
choose UTF-8 if the first 8192 bytes of the file are valid UTF-8.

-- 
Martin Simmons
LispWorks Ltd
http://www.lispworks.com/

_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html

Re: encoding: detecting external format is mistaken?

>>>>> On Thu, 29 Oct 2020 15:56:50 GMT, Martin Simmons said:

    M> Try pushing :utf-8 on sys:*specific-valid-file-encodings*.  This
    M> will make it choose UTF-8 if the first 8192 bytes of the file are
    M> valid UTF-8.

Right, this works if i'm lucky.  But for typical use-cases in this
project, #'specific-valid-file-encoding needs to check further into the
file than 8192 bytes, possibly the whole file.

Is this length settable in a variable to e.g. adjust in a scope?
Something like:

 (let ((sys:*bytes-to-check-for-valid-encodings* 20000)
   (check-my-file file)))

Or have #'specific-valid-file-encoding check to end of file?

-anders

_______________________________________________
Lisp Hug - the mailing list for LispWorks users
lisp-hug@lispworks.com
http://www.lispworks.com/support/lisp-hug.html

Re: encoding: detecting external format is mistaken?

Unable to parse email body. Email id is 15405

Updated at: 2020-12-10 08:28 UTC