most efficient way to find eol?

Windows specific forum
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

most efficient way to find eol?

Post by IdeasVacuum »

My app only reads text files, but each bunch of files may include files of different origins - written on different OS.
The code below is modified to run stand-alone, the app Procedure actually uses some global vars for expedience and performs a check to first decide if the file even needs to be processed.

Code: Select all

EnableExplicit
#FileIO = 0
Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s
Define *TextBuff

If ReadFile(#FileIO, "C:\MyTextFile.txt")

          iFormat = ReadStringFormat(#FileIO)
         qFileLen = Lof(#FileIO)
      If(qFileLen > 0)

              *TextBuff = AllocateMemory(qFileLen)
              ReadData(#FileIO, *TextBuff, qFileLen)
             CloseFile(#FileIO)

             ;Determine the end of line string
                iFound = FindString(PeekS(*TextBuff, qFileLen, iFormat), #CRLF$, 1, #PB_String_NoCase)
             If(iFound > 0)

                      sEol = "#CRLF$"
             Else
                         iFound = FindString(PeekS(*TextBuff, qFileLen, iFormat), #CR$, 1, #PB_String_NoCase)
                      If(iFound > 0)

                             sEol = "#CR$"
                      Else
                             sEol = "#LF$"
                      EndIf
             EndIf

             Debug sEol
      EndIf
Else
      Debug "ReadFile failed"
EndIf

End
As a procedure this code is called for every file of a bunch (typically 20 to 30 files). Currently, the whole file is loaded to support a subsequent search for occurrences of a string.

Now the snag is the "Determine the end of line string" code. If the Eol is not #CRLF$ (Windows OS), a second search is performed. Is there a more efficient way to find the Eol used?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Keya
Addict
Addict
Posts: 1891
Joined: Thu Jun 04, 2015 7:10 am

Re: most efficient way to find eol?

Post by Keya »

i wrote this just a few days ago, it accounts for all variations of line feed/carriage return, but i think it currently only works in ascii mode

Code: Select all

Procedure ProcessNextLine(*buf, lenbuf)
  If lenbuf <= 0: ProcedureReturn 0: EndIf
  Debug("Next line: " + PeekS(*buf, lenbuf))
EndProcedure

Procedure ProcessBuffer(*buffer, lencmd)
  If lencmd <= 0: ProcedureReturn 0: EndIf
  *pend = *buffer + lencmd
  *p1.Ascii = *buffer
  *p2.Ascii = *p1
  inline=1
  While *p2 < *pend
    Select *p2\a
      Case $0D, $0A:
        If inline=1
          ProcessNextLine(*p1, *p2-*p1)
          inline=0
          *p1=*p2+1
        Else
          *p1=*p2+1
        EndIf
      Default:
        If inline=0: inline=1: EndIf
    EndSelect
    *p2+1
  Wend
  If inline=1
    ProcessNextLine(*p1, *p2-*p1)
  EndIf
EndProcedure

*buf = AllocateMemory(512)

sTxt.s = "one"+Chr($A)+"two"+Chr($D)+Chr($A) + "three" + Chr($D) + "four" + Chr($A)+Chr($D)+ Chr($A)+Chr($D)+ Chr($A)+Chr($D)+"five"
PokeS(*buf, sTxt)

ProcessBuffer(*buf, Len(sTxt))
normeus
Enthusiast
Enthusiast
Posts: 414
Joined: Fri Apr 20, 2012 8:09 pm
Contact:

Re: most efficient way to find eol?

Post by normeus »

What if you only search the end ( also might be easier to go from the end )

Code: Select all

                iFound = FindString(PeekS(*TextBuff+qFileLen-8, 8, iFormat), #CRLF$, 1, #PB_String_NoCase)

Norm
google Translate;Makes my jokes fall flat- Fait mes blagues tombent à plat- Machte meine Witze verpuffen- Eh cumpari ci vo sunari
alter Mann
User
User
Posts: 39
Joined: Fri Oct 17, 2014 8:52 pm

Re: most efficient way to find eol?

Post by alter Mann »

my suggestion:

Code: Select all

Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s=""
Define *TextBuff
Define *cChar.Byte
Define *wChar.Word
Define i.q

; for ASCii-/Unicode-File

If ReadFile(#FileIO, "C:\MyTextFile.txt")

  iFormat = ReadStringFormat(#FileIO)
  
  qFileLen = Lof(#FileIO)
  
  If(qFileLen > 0)

    *TextBuff = AllocateMemory(qFileLen)
    ReadData(#FileIO, *TextBuff, qFileLen)
    CloseFile(#FileIO)
      
    If iFormat = #PB_Ascii
      *cChar = *TextBuff
      For i=0 To qFileLen Step 1
        If *cChar\b = 13 ; #CR$ found
          *cChar + SizeOf(Byte)
          If *cChar\b = 10 ; #CRLF$ found
            sEol = "#CRLF$"
            Break
          Else
            sEol = "#CR$"
            Break
          EndIf
        ElseIf *cChar\b = 13 ; #LF$ found
          sEol = "#LF$"
          Break
        EndIf
        *cChar + SizeOf(Byte)
      Next i
    ElseIf iFormat = #PB_Unicode
      *wChar = *TextBuff
      qFileLen / 2 ; 2 Byte per Character
      For i=0 To qFileLen Step 1
        If *wChar\w = 13 ; #CR$ found
          *wChar + SizeOf(Word)
          If *wChar\w = 10 ; #CRLF$ found
            sEol = "#CRLF$"
            Break
          Else
            sEol = "#CR$"
            Break
          EndIf
        ElseIf *wChar\w = 13 ; #LF$ found
          sEol = "#LF$"
          Break
        EndIf
        *wChar + SizeOf(Word)
      Next i
    EndIf         
    Debug sEol
  EndIf
Else
  Debug "ReadFile failed"
EndIf

End
Don't know how to do this for UTF8.
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: most efficient way to find eol?

Post by IdeasVacuum »

Thanks guys!

alter Mann's code is basically a souped-up version of Keya's and I'm going to add to that by incorporating the idea from normeus, checking the last 2 chars.

...... I'll be back with a general string search challenge later :twisted:
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
RASHAD
PureBasic Expert
PureBasic Expert
Posts: 4637
Joined: Sun Apr 12, 2009 6:27 am

Re: most efficient way to find eol?

Post by RASHAD »

Hi

Code: Select all

;EnableExplicit
#FileIO = 0
Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s
Define *TextBuff

If ReadFile(#FileIO, "e:\report.txt")

          iFormat = ReadStringFormat(#FileIO)
         qFileLen = Lof(#FileIO)
      If(qFileLen > 0)

              *TextBuff = AllocateMemory(qFileLen)
              ReadData(#FileIO, *TextBuff, qFileLen)
             CloseFile(#FileIO)

             ;Determine the end of line string
              For char = 0 To qFileLen
                If PeekA(*TextBuff+char) = 13 And PeekA(*TextBuff+char+1) = 10
                  Debug "End of Line at : "+Str(char)
                  Char + 1
                EndIf
              Next
      EndIf
Else
      Debug "ReadFile failed"
EndIf

End
Egypt my love
User avatar
skywalk
Addict
Addict
Posts: 3972
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: most efficient way to find eol?

Post by skywalk »

This is what I use for text file or string input.

Code: Select all

Procedure.s SF_GetEOL(TxtIn$, IsAFile.i=0)
  ; REV:  110405, skywalk
  ;       modified from PureBasic Forum: TerryHough, www.purebasic.fr/english/viewtopic.php?p=316256#p316256
  ;       support unicode
  Protected.i inf, inf2, locBOM
  Protected.c fc, nc
  Protected.s EOL$, s$
  EOL$ = #Empty$
  If IsAFile                  ; Some Text File
    inf = ReadFile(#PB_Any,TxtIn$)
    If inf
      Protected.i Enc; = #PB_UTF8
      Enc = ReadStringFormat(inf)
      locBOM = Loc(inf)               ; Store BOM read position
      s$ = ReadString(inf, Enc)       ; Read 1st string which skips EOL marker
      Select Enc
      Case #PB_Ascii, #PB_UTF8
        FileSeek(inf,Len(s$)+locBOM, #PB_Absolute)  ; Reposition read pointer to 1st EOL character
        fc = ReadAsciiCharacter(inf)  ; Read 1st character of EOL marker
        nc = ReadAsciiCharacter(inf)  ; See if next character = EOL character
      Default
        FileSeek(inf,StringByteLength(s$)+locBOM, #PB_Absolute) ; Reposition read pointer to 1st EOL character
        fc = ReadCharacter(inf)       ; Read 1st character of EOL marker
        nc = ReadCharacter(inf)       ; See if next character = EOL character
      EndSelect
      Select fc
      Case #CR
        Select nc
        Case #LF              ; Next character = LF
          EOL$ = #CRLF$
        Default               ; Next character not a typical EOL marker
          EOL$ = #CR$
        EndSelect
      Case #LF                ; 1st character = LF
        Select nc             ; See if next character is an EOL character
        Case #CR
          EOL$ = #LFCR$       ; Next character = CR
        Default               ; Next character not a typical EOL marker
          EOL$ = #LF$
        EndSelect
      EndSelect
    EndIf
    CloseFile(inf)
  Else                        ; ClipBoard Text or some String variable
    ; Only check for #CR$,#LF$,#CRLF$. Dropped #LFCR$ for simplicity.
    If Len(TxtIn$)
      inf = FindString(TxtIn$, #LF$)
      If inf
        inf2 = FindString(TxtIn$, #CR$, inf-1)
        If inf2 And (inf-inf2 = 1)
          EOL$ = #CRLF$
        Else
          EOL$ = #LF$
        EndIf
      Else  ; Check for old MAC
        If FindString(TxtIn$, #CR$)
          EOL$ = #CR$
        EndIf
      EndIf
    EndIf
  EndIf
  ProcedureReturn EOL$
EndProcedure
Define.s x$ = "123" + #LF$ + "345" + #LF$
Debug Len(SF_GetEOL(x$, 0))
Debug Len(SF_GetEOL(#PB_Compiler_Home+"Examples\Sources\String.pb", 1))
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: most efficient way to find eol?

Post by IdeasVacuum »

Hi Rashad

..... not quite, I need to know what the Eol is, rather than where it is.

Hi Skywalk

I did consider Opening the file to read and moving the read pointer to nearly the end of the file, but it's inefficient if the code is later loading the file into a buffer. Also, I discovered I can't work from the end of the file as some of the files do not have an EOL on the last line.

All

Well, it seems that the original code is no less efficient than the alternatives suggested, but that does depend on how the subsequent code will handle finding a string and displaying the text lines and line numbers. My current code for that is accurate, but not fast (At least not compared to NotePad++. My code takes just over 1 second per file and NP++ does 12 files per second!) So that's my next post :)
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Keya
Addict
Addict
Posts: 1891
Joined: Thu Jun 04, 2015 7:10 am

Re: most efficient way to find eol?

Post by Keya »

btw if your final solution ends up using PB's general-purpose FindString() you could improve performance there with wilbert's high speed FindData() (Boyer-Moore etc, pretty slick), although seeing as you're only looking for 2 bytes im wondering if that might actually turn out a bit slower, im not sure
Last edited by Keya on Fri Dec 16, 2016 3:21 am, edited 1 time in total.
User avatar
skywalk
Addict
Addict
Posts: 3972
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: most efficient way to find eol?

Post by skywalk »

IdeasVacuum wrote:Hi Skywalk

I did consider Opening the file to read and moving the read pointer to nearly the end of the file, but it's inefficient if the code is later loading the file into a buffer. Also, I discovered I can't work from the end of the file as some of the files do not have an EOL on the last line.
That is not what my code is doing? I'm only moving past the byte order mark (BOM) and reading a single line. You can modify the procedure to return a handle to the opened file if you want to save some time. It is not a bottleneck for my use case.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: most efficient way to find eol?

Post by IdeasVacuum »

Hi skywalk

Yes I appreciate that is not what you are doing, I qualified my remark.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
Post Reply