READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Hi

I have the following test function on LWW 4.2.7

(defun test-rs (filename)
   (with-open-file (s filename :direction :input)
     (let ((buffer (make-array (file-length s)
                               :element-type (stream-element-type s)))
           (chars-read 0)
           )
       (format t "Trying to READ-SEQUENCE on stream ~S of length ~D.~%"
               s
               (file-length s))
       (setf chars-read (read-sequence buffer s))
       (cond ((< chars-read (file-length s))
              (format t "Read only ~D chars instead of ~D~%"
                      chars-read
                      (file-length s))
              (subseq buffer 0 chars-read))
             (t
              buffer)))))


turns out that it appears that the FILE-LENGTH is greater than the 
actual file contents.

Has anybody observed this effect?

Cheers
--
Marco

RE: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Could it be something like a UTF8-encoded file containing non-ASCII
characters? See

 
<http://www.google.com/groups?selm=8765gs6qve.fsf%40bird.agharta.de&oe=UTF-8
&output=gplain>

Edi.

> -----Original Message-----
> From: owner-lisp-hug@xanalys.com 
> [mailto:owner-lisp-hug@xanalys.com] On Behalf Of Marco Antoniotti
> Sent: Montag, 10. Mai 2004 22:08
> To: lisp-hug@xanalys.com
> Subject: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)
> 
> 
> Hi
> 
> I have the following test function on LWW 4.2.7
> 
> (defun test-rs (filename)
>    (with-open-file (s filename :direction :input)
>      (let ((buffer (make-array (file-length s)
>                                :element-type (stream-element-type s)))
>            (chars-read 0)
>            )
>        (format t "Trying to READ-SEQUENCE on stream ~S of 
> length ~D.~%"
>                s
>                (file-length s))
>        (setf chars-read (read-sequence buffer s))
>        (cond ((< chars-read (file-length s))
>               (format t "Read only ~D chars instead of ~D~%"
>                       chars-read
>                       (file-length s))
>               (subseq buffer 0 chars-read))
>              (t
>               buffer)))))
> 
> 
> turns out that it appears that the FILE-LENGTH is greater than the 
> actual file contents.
> 
> Has anybody observed this effect?
> 
> Cheers
> --
> Marco
> 
>

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Well,  I thought so.  But I am unsure that is the case.  The file I am  
testing was edited with LWW itself.

Now,  checking with 'wc' on a Linux box shows that the difference  
between the file length and the number of chars read is suspiciously  
close to the number of lines in the file.  The file is a DOS file.   
Maybe an end of line mismatch?

As an aside, this is a show stopper for 'albert' and LW(W)

Cheers

Marco




On Monday, May 10, 2004, at 16:31 America/New_York, Edi Weitz wrote:

> Could it be something like a UTF8-encoded file containing non-ASCII
> characters? See
>
>
> <http://www.google.com/ 
> groups?selm=8765gs6qve.fsf%40bird.agharta.de&oe=UTF-8
> &output=gplain>
>
> Edi.
>
>> -----Original Message-----
>> From: owner-lisp-hug@xanalys.com
>> [mailto:owner-lisp-hug@xanalys.com] On Behalf Of Marco Antoniotti
>> Sent: Montag, 10. Mai 2004 22:08
>> To: lisp-hug@xanalys.com
>> Subject: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)
>>
>>
>> Hi
>>
>> I have the following test function on LWW 4.2.7
>>
>> (defun test-rs (filename)
>>    (with-open-file (s filename :direction :input)
>>      (let ((buffer (make-array (file-length s)
>>                                :element-type (stream-element-type s)))
>>            (chars-read 0)
>>            )
>>        (format t "Trying to READ-SEQUENCE on stream ~S of
>> length ~D.~%"
>>                s
>>                (file-length s))
>>        (setf chars-read (read-sequence buffer s))
>>        (cond ((< chars-read (file-length s))
>>               (format t "Read only ~D chars instead of ~D~%"
>>                       chars-read
>>                       (file-length s))
>>               (subseq buffer 0 chars-read))
>>              (t
>>               buffer)))))
>>
>>
>> turns out that it appears that the FILE-LENGTH is greater than the
>> actual file contents.
>>
>> Has anybody observed this effect?
>>
>> Cheers
>> --
>> Marco
>>
>>
>>
--
Marco Antoniotti					http://bioinformatics.nyu.edu
NYU Courant Bioinformatics Group		tel. +1 - 212 - 998 3488
715 Broadway 10th FL				fax. +1 - 212 - 998 3484
New York, NY, 10003, U.S.A.

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Unable to parse email body. Email id is 2304

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Thanks to all who responded.  The problem seems indeed one of DOS/UNIX 
linefeed mismatch.

For the case at hand (trying to get 'albert' to work), returning the 
subsequence (memory costly operation) appears to work.

I'll investigate the external encoding feature for future cases.

However, I am not sure that the semantics of READ-SEQUENCE warrants the 
linefeed translation.  Oh well.

Cheers

marco




On Tuesday, May 11, 2004, at 08:29 America/New_York, davef@xanalys.com 
wrote:

>
>    Well,  I thought so.  But I am unsure that is the case.  The file I 
> am
>    testing was edited with LWW itself.
>
> LispWorks is capable of editing and creating files with different
> encodings. See the Editor User Guide for details.
>
>    Now,  checking with 'wc' on a Linux box shows that the difference
>    between the file length and the number of chars read is suspiciously
>    close to the number of lines in the file.
>
> Is that difference exactly equal to the number of lines?
>
>                                             The file is a DOS file.
>    Maybe an end of line mismatch?
>
> I guess that you have a CRLF-line-terminated file. OPEN detects this
> and creates a file stream with an appropriate external format. The
> Lisp line terminator is LF, so when that file is read into Lisp the
> external-format maps each CRLF pair to LF.
>
> You can check the stream's external format by STREAM-EXTERNAL-FORMAT.
>
> LispWorks FILE-LENGTH does not take account of the external format,
> because in general it would need to read the entire file to achieve
> that. Perhaps it should return NIL rather than the file's byte-length
> in such cases. In any case, your code needs to allow for FILE-LENGTH
> returning NIL.
>
>    As an aside, this is a show stopper for 'albert' and LW(W)
>
> Perhaps you can simply call the LispWorks function FILE-STRING, which
> does do external formats. Alternately you could hack it by specifying
> a no-conversion external format or even use a binary stream - you'll
> need to think about whether you want to see those Control-M characters
> in your Lisp strings if you take this route.
>
>
>    Cheers
>
>    Marco
>
>
>
>
>    On Monday, May 10, 2004, at 16:31 America/New_York, Edi Weitz wrote:
>
>> Could it be something like a UTF8-encoded file containing non-ASCII
>> characters? See
>>
>>
>> <http://www.google.com/
>> groups?selm=8765gs6qve.fsf%40bird.agharta.de&oe=UTF-8
>> &output=gplain>
>>
>> Edi.
>>
>>> -----Original Message-----
>>> From: owner-lisp-hug@xanalys.com
>>> [mailto:owner-lisp-hug@xanalys.com] On Behalf Of Marco Antoniotti
>>> Sent: Montag, 10. Mai 2004 22:08
>>> To: lisp-hug@xanalys.com
>>> Subject: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)
>>>
>>>
>>> Hi
>>>
>>> I have the following test function on LWW 4.2.7
>>>
>>> (defun test-rs (filename)
>>>    (with-open-file (s filename :direction :input)
>>>      (let ((buffer (make-array (file-length s)
>>>                                :element-type (stream-element-type 
>>> s)))
>>>            (chars-read 0)
>>>            )
>>>        (format t "Trying to READ-SEQUENCE on stream ~S of
>>> length ~D.~%"
>>>                s
>>>                (file-length s))
>>>        (setf chars-read (read-sequence buffer s))
>>>        (cond ((< chars-read (file-length s))
>>>               (format t "Read only ~D chars instead of ~D~%"
>>>                       chars-read
>>>                       (file-length s))
>>>               (subseq buffer 0 chars-read))
>>>              (t
>>>               buffer)))))
>>>
>>>
>>> turns out that it appears that the FILE-LENGTH is greater than the
>>> actual file contents.
>>>
>>> Has anybody observed this effect?
>>>
>>> Cheers
>>> --
>>> Marco
>>>
>>>
>>>
>    --
>    Marco Antoniotti					http://bioinformatics.nyu.edu
>    NYU Courant Bioinformatics Group		tel. +1 - 212 - 998 3488
>    715 Broadway 10th FL				fax. +1 - 212 - 998 3484
>    New York, NY, 10003, U.S.A.
>
> --
> Dave Fox			
>
> Xanalys                 http://www.lispworks.com
> Compass House
> Vision Park, Chivers Way
> Histon
> Cambridge, CB4 9AD
> England
>
--
Marco Antoniotti					http://bioinformatics.nyu.edu
NYU Courant Bioinformatics Group		tel. +1 - 212 - 998 3488
715 Broadway 10th FL				fax. +1 - 212 - 998 3484
New York, NY, 10003, U.S.A.

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Unable to parse email body. Email id is 2308

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Ok.  However,  if you say that the translation "happens in the stream",  
than, I could counter that FILE-LENGTH should take that into account as  
well.

Isn't interpreting the CLHS fun? :)

Cheers
--
Marco



On Tuesday, May 11, 2004, at 11:03 America/New_York, davef@xanalys.com  
wrote:

>
>    Thanks to all who responded.  The problem seems indeed one of  
> DOS/UNIX
>    linefeed mismatch.
>
>    For the case at hand (trying to get 'albert' to work), returning the
>    subsequence (memory costly operation) appears to work.
>
>    I'll investigate the external encoding feature for future cases.
>
>    However, I am not sure that the semantics of READ-SEQUENCE warrants  
> the
>    linefeed translation.  Oh well.
>
> I'm sure that it does warrant the linefeed translation. The
> translation is in the stream, not the particular input function.
>
> Note this in the ANSI Common Lisp specification at
> http://www.lispworks.com/reference/HyperSpec/Body/f_rd_seq.htm#read- 
> sequence:
>
>  read-sequence is identical in effect to iterating over the indicated
>  subsequence and reading one element at a time from stream and storing
>  it into sequence, but may be more efficient than the equivalent
>  loop. An efficient implementation is more likely to exist for the
>  case where the sequence is a vector with the same element type as the
>  stream.
>
>
> --
> Dave Fox			
>
> Xanalys                 http://www.lispworks.com
> Compass House
> Vision Park, Chivers Way
> Histon
> Cambridge, CB4 9AD
> England
>
--
Marco Antoniotti					http://bioinformatics.nyu.edu
NYU Courant Bioinformatics Group		tel. +1 - 212 - 998 3488
715 Broadway 10th FL				fax. +1 - 212 - 998 3484
New York, NY, 10003, U.S.A.

Unable to render article 2311 because of ":DEFAULT stream decoding error on #<SB-SYS:FD-STREAM for \"socket 192.168.43.216:64752, peer: 116.202.254.214:119\" {1007498473}>: the octet sequence #(246 115 99 104) cannot be decoded." error

Unable to render article 2312 because of ":DEFAULT stream decoding error on #<SB-SYS:FD-STREAM for \"socket 192.168.43.216:64754, peer: 116.202.254.214:119\" {1002120AD3}>: the octet sequence #(246 115 99 104) cannot be decoded." error

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Unable to parse email body. Email id is 2313

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

Marco Antoniotti wrote:

 (defun test-rs (filename)
   (with-open-file (s filename :direction :input)
     (let ((buffer (make-array (file-length s)
                               :element-type (stream-element-type s)))
           (chars-read 0))
       (format t "Trying to READ-SEQUENCE on stream ~S of length ~D.~%"
               s
               (file-length s))
       (setf chars-read (read-sequence buffer s))
       (cond ((< chars-read (file-length s))
              (format t "Read only ~D chars instead of ~D~%"
                      chars-read
                      (file-length s))
              (subseq buffer 0 chars-read))
             (t buffer)))))

> turns out that it appears that the FILE-LENGTH is greater than the actual
> file contents.

Yes, this is the classical problem with the DOS CR-LF end-of-line encoding.
Each line end is encoded with two characters: a 'carriage return' (#\return,
character code 13) and a 'line feed' (#\linefeed, character code 10).

FILE-LENGTH apparently returns the number of bytes in the file as reported
by the operating system. But READ-SEQUENCE translates each CR-LF pair to
a single #\linefeed (or is it #\newline?) character. So your buffer will
contain fewer characters than the reported file length.

There are several solutions ('workarounds' is probably a better word here):
  1. Don't worry about it, and just return the subsequence of the buffer
     that's actually filled. That's what you do in your example, and I
     think that's a perfectly decent solution.
  2. Specify something like
      :EXTERNAL-FORMAT '(:LATIN-1 :EOL-STYLE :LF)
     as a parameter for with-open-file.
     In that case, the CR-LF pairs won't get translated to a single
     character, and you'll get a superfluous #\return (or a superfluous
     #\linefeed, depending on your point of view ;-) character for each
     line in your buffer.
  3. You could write a version of FILE-LENGTH that scans the whole file
     and only counts 1 character for each CR-LF pair. This feels like
     overkill to me.
  4. Don't use the READ-SEQUENCE trick, and just read each line with
     READ-LINE. This also solves another problem with your code: it
     won't work for files with a length greater than array-total-size-limit
     (which is 1048448 in my version of Lispworks). I think this is the
     best solution, but it probably means that you'll have to rewrite
     other parts of your program.

I hope this helps.

Regards,

Arthur Lemmens

Re: READ-SEQUENCE and FILE-LENGTH (LWW 4.2.7)

* Marco Antoniotti wrote:
> Hi
> I have the following test function on LWW 4.2.7

> (defun test-rs (filename)
>    (with-open-file (s filename :direction :input)
>      (let ((buffer (make-array (file-length s)
>                                :element-type (stream-element-type s)))
>            (chars-read 0)
>            )
>        (format t "Trying to READ-SEQUENCE on stream ~S of length ~D.~%"
>                s
>                (file-length s))
>        (setf chars-read (read-sequence buffer s))
>        (cond ((< chars-read (file-length s))
>               (format t "Read only ~D chars instead of ~D~%"
>                       chars-read
>                       (file-length s))
>               (subseq buffer 0 chars-read))
>              (t
>               buffer)))))

This is the classic Unix/ascii mistake.  If you want to know how many
characters are in a file you either need an OS that keeps track of
this for you, or you need to decode the file to find out.  Windows and
Unix only know the number of *octets* in the file.  If you're working
in ASCII, *and* if you're on a Unix machine using its native line-end
encoding, then this happens to be the same thing, and so lots of
Unix/ASCII applications assume this will work in general, which it
won't.

In your case (from a later message), it looks like you have a file
with Windows/DOS line-ends (CR/LF), which are being translated to
Lisp's #\Newline on read.  So the octet-length of the file is longer
than the character length.

--tim