Page 1 of 1

most efficient way to find eol?

Posted: Thu Dec 15, 2016 2:26 pm
by IdeasVacuum
My app only reads text files, but each bunch of files may include files of different origins - written on different OS.
The code below is modified to run stand-alone, the app Procedure actually uses some global vars for expedience and performs a check to first decide if the file even needs to be processed.

Code: Select all

EnableExplicit
#FileIO = 0
Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s
Define *TextBuff

If ReadFile(#FileIO, "C:\MyTextFile.txt")

          iFormat = ReadStringFormat(#FileIO)
         qFileLen = Lof(#FileIO)
      If(qFileLen > 0)

              *TextBuff = AllocateMemory(qFileLen)
              ReadData(#FileIO, *TextBuff, qFileLen)
             CloseFile(#FileIO)

             ;Determine the end of line string
                iFound = FindString(PeekS(*TextBuff, qFileLen, iFormat), #CRLF$, 1, #PB_String_NoCase)
             If(iFound > 0)

                      sEol = "#CRLF$"
             Else
                         iFound = FindString(PeekS(*TextBuff, qFileLen, iFormat), #CR$, 1, #PB_String_NoCase)
                      If(iFound > 0)

                             sEol = "#CR$"
                      Else
                             sEol = "#LF$"
                      EndIf
             EndIf

             Debug sEol
      EndIf
Else
      Debug "ReadFile failed"
EndIf

End
As a procedure this code is called for every file of a bunch (typically 20 to 30 files). Currently, the whole file is loaded to support a subsequent search for occurrences of a string.

Now the snag is the "Determine the end of line string" code. If the Eol is not #CRLF$ (Windows OS), a second search is performed. Is there a more efficient way to find the Eol used?

Re: most efficient way to find eol?

Posted: Thu Dec 15, 2016 3:15 pm
by Keya
i wrote this just a few days ago, it accounts for all variations of line feed/carriage return, but i think it currently only works in ascii mode

Code: Select all

Procedure ProcessNextLine(*buf, lenbuf)
  If lenbuf <= 0: ProcedureReturn 0: EndIf
  Debug("Next line: " + PeekS(*buf, lenbuf))
EndProcedure

Procedure ProcessBuffer(*buffer, lencmd)
  If lencmd <= 0: ProcedureReturn 0: EndIf
  *pend = *buffer + lencmd
  *p1.Ascii = *buffer
  *p2.Ascii = *p1
  inline=1
  While *p2 < *pend
    Select *p2\a
      Case $0D, $0A:
        If inline=1
          ProcessNextLine(*p1, *p2-*p1)
          inline=0
          *p1=*p2+1
        Else
          *p1=*p2+1
        EndIf
      Default:
        If inline=0: inline=1: EndIf
    EndSelect
    *p2+1
  Wend
  If inline=1
    ProcessNextLine(*p1, *p2-*p1)
  EndIf
EndProcedure

*buf = AllocateMemory(512)

sTxt.s = "one"+Chr($A)+"two"+Chr($D)+Chr($A) + "three" + Chr($D) + "four" + Chr($A)+Chr($D)+ Chr($A)+Chr($D)+ Chr($A)+Chr($D)+"five"
PokeS(*buf, sTxt)

ProcessBuffer(*buf, Len(sTxt))

Re: most efficient way to find eol?

Posted: Thu Dec 15, 2016 5:45 pm
by normeus
What if you only search the end ( also might be easier to go from the end )

Code: Select all

                iFound = FindString(PeekS(*TextBuff+qFileLen-8, 8, iFormat), #CRLF$, 1, #PB_String_NoCase)

Norm

Re: most efficient way to find eol?

Posted: Thu Dec 15, 2016 9:36 pm
by alter Mann
my suggestion:

Code: Select all

Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s=""
Define *TextBuff
Define *cChar.Byte
Define *wChar.Word
Define i.q

; for ASCii-/Unicode-File

If ReadFile(#FileIO, "C:\MyTextFile.txt")

  iFormat = ReadStringFormat(#FileIO)
  
  qFileLen = Lof(#FileIO)
  
  If(qFileLen > 0)

    *TextBuff = AllocateMemory(qFileLen)
    ReadData(#FileIO, *TextBuff, qFileLen)
    CloseFile(#FileIO)
      
    If iFormat = #PB_Ascii
      *cChar = *TextBuff
      For i=0 To qFileLen Step 1
        If *cChar\b = 13 ; #CR$ found
          *cChar + SizeOf(Byte)
          If *cChar\b = 10 ; #CRLF$ found
            sEol = "#CRLF$"
            Break
          Else
            sEol = "#CR$"
            Break
          EndIf
        ElseIf *cChar\b = 13 ; #LF$ found
          sEol = "#LF$"
          Break
        EndIf
        *cChar + SizeOf(Byte)
      Next i
    ElseIf iFormat = #PB_Unicode
      *wChar = *TextBuff
      qFileLen / 2 ; 2 Byte per Character
      For i=0 To qFileLen Step 1
        If *wChar\w = 13 ; #CR$ found
          *wChar + SizeOf(Word)
          If *wChar\w = 10 ; #CRLF$ found
            sEol = "#CRLF$"
            Break
          Else
            sEol = "#CR$"
            Break
          EndIf
        ElseIf *wChar\w = 13 ; #LF$ found
          sEol = "#LF$"
          Break
        EndIf
        *wChar + SizeOf(Word)
      Next i
    EndIf         
    Debug sEol
  EndIf
Else
  Debug "ReadFile failed"
EndIf

End
Don't know how to do this for UTF8.

Re: most efficient way to find eol?

Posted: Thu Dec 15, 2016 10:23 pm
by IdeasVacuum
Thanks guys!

alter Mann's code is basically a souped-up version of Keya's and I'm going to add to that by incorporating the idea from normeus, checking the last 2 chars.

...... I'll be back with a general string search challenge later :twisted:

Re: most efficient way to find eol?

Posted: Thu Dec 15, 2016 10:59 pm
by RASHAD
Hi

Code: Select all

;EnableExplicit
#FileIO = 0
Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s
Define *TextBuff

If ReadFile(#FileIO, "e:\report.txt")

          iFormat = ReadStringFormat(#FileIO)
         qFileLen = Lof(#FileIO)
      If(qFileLen > 0)

              *TextBuff = AllocateMemory(qFileLen)
              ReadData(#FileIO, *TextBuff, qFileLen)
             CloseFile(#FileIO)

             ;Determine the end of line string
              For char = 0 To qFileLen
                If PeekA(*TextBuff+char) = 13 And PeekA(*TextBuff+char+1) = 10
                  Debug "End of Line at : "+Str(char)
                  Char + 1
                EndIf
              Next
      EndIf
Else
      Debug "ReadFile failed"
EndIf

End

Re: most efficient way to find eol?

Posted: Fri Dec 16, 2016 12:05 am
by skywalk
This is what I use for text file or string input.

Code: Select all

Procedure.s SF_GetEOL(TxtIn$, IsAFile.i=0)
  ; REV:  110405, skywalk
  ;       modified from PureBasic Forum: TerryHough, www.purebasic.fr/english/viewtopic.php?p=316256#p316256
  ;       support unicode
  Protected.i inf, inf2, locBOM
  Protected.c fc, nc
  Protected.s EOL$, s$
  EOL$ = #Empty$
  If IsAFile                  ; Some Text File
    inf = ReadFile(#PB_Any,TxtIn$)
    If inf
      Protected.i Enc; = #PB_UTF8
      Enc = ReadStringFormat(inf)
      locBOM = Loc(inf)               ; Store BOM read position
      s$ = ReadString(inf, Enc)       ; Read 1st string which skips EOL marker
      Select Enc
      Case #PB_Ascii, #PB_UTF8
        FileSeek(inf,Len(s$)+locBOM, #PB_Absolute)  ; Reposition read pointer to 1st EOL character
        fc = ReadAsciiCharacter(inf)  ; Read 1st character of EOL marker
        nc = ReadAsciiCharacter(inf)  ; See if next character = EOL character
      Default
        FileSeek(inf,StringByteLength(s$)+locBOM, #PB_Absolute) ; Reposition read pointer to 1st EOL character
        fc = ReadCharacter(inf)       ; Read 1st character of EOL marker
        nc = ReadCharacter(inf)       ; See if next character = EOL character
      EndSelect
      Select fc
      Case #CR
        Select nc
        Case #LF              ; Next character = LF
          EOL$ = #CRLF$
        Default               ; Next character not a typical EOL marker
          EOL$ = #CR$
        EndSelect
      Case #LF                ; 1st character = LF
        Select nc             ; See if next character is an EOL character
        Case #CR
          EOL$ = #LFCR$       ; Next character = CR
        Default               ; Next character not a typical EOL marker
          EOL$ = #LF$
        EndSelect
      EndSelect
    EndIf
    CloseFile(inf)
  Else                        ; ClipBoard Text or some String variable
    ; Only check for #CR$,#LF$,#CRLF$. Dropped #LFCR$ for simplicity.
    If Len(TxtIn$)
      inf = FindString(TxtIn$, #LF$)
      If inf
        inf2 = FindString(TxtIn$, #CR$, inf-1)
        If inf2 And (inf-inf2 = 1)
          EOL$ = #CRLF$
        Else
          EOL$ = #LF$
        EndIf
      Else  ; Check for old MAC
        If FindString(TxtIn$, #CR$)
          EOL$ = #CR$
        EndIf
      EndIf
    EndIf
  EndIf
  ProcedureReturn EOL$
EndProcedure
Define.s x$ = "123" + #LF$ + "345" + #LF$
Debug Len(SF_GetEOL(x$, 0))
Debug Len(SF_GetEOL(#PB_Compiler_Home+"Examples\Sources\String.pb", 1))

Re: most efficient way to find eol?

Posted: Fri Dec 16, 2016 1:21 am
by IdeasVacuum
Hi Rashad

..... not quite, I need to know what the Eol is, rather than where it is.

Hi Skywalk

I did consider Opening the file to read and moving the read pointer to nearly the end of the file, but it's inefficient if the code is later loading the file into a buffer. Also, I discovered I can't work from the end of the file as some of the files do not have an EOL on the last line.

All

Well, it seems that the original code is no less efficient than the alternatives suggested, but that does depend on how the subsequent code will handle finding a string and displaying the text lines and line numbers. My current code for that is accurate, but not fast (At least not compared to NotePad++. My code takes just over 1 second per file and NP++ does 12 files per second!) So that's my next post :)

Re: most efficient way to find eol?

Posted: Fri Dec 16, 2016 3:20 am
by Keya
btw if your final solution ends up using PB's general-purpose FindString() you could improve performance there with wilbert's high speed FindData() (Boyer-Moore etc, pretty slick), although seeing as you're only looking for 2 bytes im wondering if that might actually turn out a bit slower, im not sure

Re: most efficient way to find eol?

Posted: Fri Dec 16, 2016 3:20 am
by skywalk
IdeasVacuum wrote:Hi Skywalk

I did consider Opening the file to read and moving the read pointer to nearly the end of the file, but it's inefficient if the code is later loading the file into a buffer. Also, I discovered I can't work from the end of the file as some of the files do not have an EOL on the last line.
That is not what my code is doing? I'm only moving past the byte order mark (BOM) and reading a single line. You can modify the procedure to return a handle to the opened file if you want to save some time. It is not a bottleneck for my use case.

Re: most efficient way to find eol?

Posted: Fri Dec 16, 2016 3:47 am
by IdeasVacuum
Hi skywalk

Yes I appreciate that is not what you are doing, I qualified my remark.