Page 1 of 1
most efficient way to find eol?
Posted: Thu Dec 15, 2016 2:26 pm
by IdeasVacuum
My app only reads text files, but each bunch of files may include files of different origins - written on different OS.
The code below is modified to run stand-alone, the app Procedure actually uses some global vars for expedience and performs a check to first decide if the file even needs to be processed.
Code: Select all
EnableExplicit
#FileIO = 0
Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s
Define *TextBuff
If ReadFile(#FileIO, "C:\MyTextFile.txt")
iFormat = ReadStringFormat(#FileIO)
qFileLen = Lof(#FileIO)
If(qFileLen > 0)
*TextBuff = AllocateMemory(qFileLen)
ReadData(#FileIO, *TextBuff, qFileLen)
CloseFile(#FileIO)
;Determine the end of line string
iFound = FindString(PeekS(*TextBuff, qFileLen, iFormat), #CRLF$, 1, #PB_String_NoCase)
If(iFound > 0)
sEol = "#CRLF$"
Else
iFound = FindString(PeekS(*TextBuff, qFileLen, iFormat), #CR$, 1, #PB_String_NoCase)
If(iFound > 0)
sEol = "#CR$"
Else
sEol = "#LF$"
EndIf
EndIf
Debug sEol
EndIf
Else
Debug "ReadFile failed"
EndIf
End
As a procedure this code is called for every file of a bunch (typically 20 to 30 files). Currently, the whole file is loaded to support a subsequent search for occurrences of a string.
Now the snag is the "Determine the end of line string" code. If the Eol is not #CRLF$ (Windows OS), a second search is performed. Is there a more efficient way to find the Eol used?
Re: most efficient way to find eol?
Posted: Thu Dec 15, 2016 3:15 pm
by Keya
i wrote this just a few days ago, it accounts for all variations of line feed/carriage return, but i think it currently only works in ascii mode
Code: Select all
Procedure ProcessNextLine(*buf, lenbuf)
If lenbuf <= 0: ProcedureReturn 0: EndIf
Debug("Next line: " + PeekS(*buf, lenbuf))
EndProcedure
Procedure ProcessBuffer(*buffer, lencmd)
If lencmd <= 0: ProcedureReturn 0: EndIf
*pend = *buffer + lencmd
*p1.Ascii = *buffer
*p2.Ascii = *p1
inline=1
While *p2 < *pend
Select *p2\a
Case $0D, $0A:
If inline=1
ProcessNextLine(*p1, *p2-*p1)
inline=0
*p1=*p2+1
Else
*p1=*p2+1
EndIf
Default:
If inline=0: inline=1: EndIf
EndSelect
*p2+1
Wend
If inline=1
ProcessNextLine(*p1, *p2-*p1)
EndIf
EndProcedure
*buf = AllocateMemory(512)
sTxt.s = "one"+Chr($A)+"two"+Chr($D)+Chr($A) + "three" + Chr($D) + "four" + Chr($A)+Chr($D)+ Chr($A)+Chr($D)+ Chr($A)+Chr($D)+"five"
PokeS(*buf, sTxt)
ProcessBuffer(*buf, Len(sTxt))
Re: most efficient way to find eol?
Posted: Thu Dec 15, 2016 5:45 pm
by normeus
What if you only search the end ( also might be easier to go from the end )
Code: Select all
iFound = FindString(PeekS(*TextBuff+qFileLen-8, 8, iFormat), #CRLF$, 1, #PB_String_NoCase)
Norm
Re: most efficient way to find eol?
Posted: Thu Dec 15, 2016 9:36 pm
by alter Mann
my suggestion:
Code: Select all
Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s=""
Define *TextBuff
Define *cChar.Byte
Define *wChar.Word
Define i.q
; for ASCii-/Unicode-File
If ReadFile(#FileIO, "C:\MyTextFile.txt")
iFormat = ReadStringFormat(#FileIO)
qFileLen = Lof(#FileIO)
If(qFileLen > 0)
*TextBuff = AllocateMemory(qFileLen)
ReadData(#FileIO, *TextBuff, qFileLen)
CloseFile(#FileIO)
If iFormat = #PB_Ascii
*cChar = *TextBuff
For i=0 To qFileLen Step 1
If *cChar\b = 13 ; #CR$ found
*cChar + SizeOf(Byte)
If *cChar\b = 10 ; #CRLF$ found
sEol = "#CRLF$"
Break
Else
sEol = "#CR$"
Break
EndIf
ElseIf *cChar\b = 13 ; #LF$ found
sEol = "#LF$"
Break
EndIf
*cChar + SizeOf(Byte)
Next i
ElseIf iFormat = #PB_Unicode
*wChar = *TextBuff
qFileLen / 2 ; 2 Byte per Character
For i=0 To qFileLen Step 1
If *wChar\w = 13 ; #CR$ found
*wChar + SizeOf(Word)
If *wChar\w = 10 ; #CRLF$ found
sEol = "#CRLF$"
Break
Else
sEol = "#CR$"
Break
EndIf
ElseIf *wChar\w = 13 ; #LF$ found
sEol = "#LF$"
Break
EndIf
*wChar + SizeOf(Word)
Next i
EndIf
Debug sEol
EndIf
Else
Debug "ReadFile failed"
EndIf
End
Don't know how to do this for UTF8.
Re: most efficient way to find eol?
Posted: Thu Dec 15, 2016 10:23 pm
by IdeasVacuum
Thanks guys!
alter Mann's code is basically a souped-up version of Keya's and I'm going to add to that by incorporating the idea from normeus, checking the last 2 chars.
...... I'll be back with a general string search challenge later

Re: most efficient way to find eol?
Posted: Thu Dec 15, 2016 10:59 pm
by RASHAD
Hi
Code: Select all
;EnableExplicit
#FileIO = 0
Define iFound.i, iFormat.i, iReadOk.i = #False
Define qFileLen.q, sEol.s
Define *TextBuff
If ReadFile(#FileIO, "e:\report.txt")
iFormat = ReadStringFormat(#FileIO)
qFileLen = Lof(#FileIO)
If(qFileLen > 0)
*TextBuff = AllocateMemory(qFileLen)
ReadData(#FileIO, *TextBuff, qFileLen)
CloseFile(#FileIO)
;Determine the end of line string
For char = 0 To qFileLen
If PeekA(*TextBuff+char) = 13 And PeekA(*TextBuff+char+1) = 10
Debug "End of Line at : "+Str(char)
Char + 1
EndIf
Next
EndIf
Else
Debug "ReadFile failed"
EndIf
End
Re: most efficient way to find eol?
Posted: Fri Dec 16, 2016 12:05 am
by skywalk
This is what I use for text file or string input.
Code: Select all
Procedure.s SF_GetEOL(TxtIn$, IsAFile.i=0)
; REV: 110405, skywalk
; modified from PureBasic Forum: TerryHough, www.purebasic.fr/english/viewtopic.php?p=316256#p316256
; support unicode
Protected.i inf, inf2, locBOM
Protected.c fc, nc
Protected.s EOL$, s$
EOL$ = #Empty$
If IsAFile ; Some Text File
inf = ReadFile(#PB_Any,TxtIn$)
If inf
Protected.i Enc; = #PB_UTF8
Enc = ReadStringFormat(inf)
locBOM = Loc(inf) ; Store BOM read position
s$ = ReadString(inf, Enc) ; Read 1st string which skips EOL marker
Select Enc
Case #PB_Ascii, #PB_UTF8
FileSeek(inf,Len(s$)+locBOM, #PB_Absolute) ; Reposition read pointer to 1st EOL character
fc = ReadAsciiCharacter(inf) ; Read 1st character of EOL marker
nc = ReadAsciiCharacter(inf) ; See if next character = EOL character
Default
FileSeek(inf,StringByteLength(s$)+locBOM, #PB_Absolute) ; Reposition read pointer to 1st EOL character
fc = ReadCharacter(inf) ; Read 1st character of EOL marker
nc = ReadCharacter(inf) ; See if next character = EOL character
EndSelect
Select fc
Case #CR
Select nc
Case #LF ; Next character = LF
EOL$ = #CRLF$
Default ; Next character not a typical EOL marker
EOL$ = #CR$
EndSelect
Case #LF ; 1st character = LF
Select nc ; See if next character is an EOL character
Case #CR
EOL$ = #LFCR$ ; Next character = CR
Default ; Next character not a typical EOL marker
EOL$ = #LF$
EndSelect
EndSelect
EndIf
CloseFile(inf)
Else ; ClipBoard Text or some String variable
; Only check for #CR$,#LF$,#CRLF$. Dropped #LFCR$ for simplicity.
If Len(TxtIn$)
inf = FindString(TxtIn$, #LF$)
If inf
inf2 = FindString(TxtIn$, #CR$, inf-1)
If inf2 And (inf-inf2 = 1)
EOL$ = #CRLF$
Else
EOL$ = #LF$
EndIf
Else ; Check for old MAC
If FindString(TxtIn$, #CR$)
EOL$ = #CR$
EndIf
EndIf
EndIf
EndIf
ProcedureReturn EOL$
EndProcedure
Define.s x$ = "123" + #LF$ + "345" + #LF$
Debug Len(SF_GetEOL(x$, 0))
Debug Len(SF_GetEOL(#PB_Compiler_Home+"Examples\Sources\String.pb", 1))
Re: most efficient way to find eol?
Posted: Fri Dec 16, 2016 1:21 am
by IdeasVacuum
Hi Rashad
..... not quite, I need to know what the Eol is, rather than where it is.
Hi Skywalk
I did consider Opening the file to read and moving the read pointer to nearly the end of the file, but it's inefficient if the code is later loading the file into a buffer. Also, I discovered I can't work from the end of the file as some of the files do not have an EOL on the last line.
All
Well, it seems that the original code is no less efficient than the alternatives suggested, but that does depend on how the subsequent code will handle finding a string and displaying the text lines and line numbers. My current code for that is accurate, but not fast (At least not compared to NotePad++. My code takes just over 1 second per file and NP++ does 12 files per second!) So that's my next post

Re: most efficient way to find eol?
Posted: Fri Dec 16, 2016 3:20 am
by Keya
btw if your final solution ends up using PB's general-purpose FindString() you could improve performance there with wilbert's high speed
FindData() (Boyer-Moore etc, pretty slick), although seeing as you're only looking for 2 bytes im wondering if that might actually turn out a bit slower, im not sure
Re: most efficient way to find eol?
Posted: Fri Dec 16, 2016 3:20 am
by skywalk
IdeasVacuum wrote:Hi Skywalk
I did consider Opening the file to read and moving the read pointer to nearly the end of the file, but it's inefficient if the code is later loading the file into a buffer. Also, I discovered I can't work from the end of the file as some of the files do not have an EOL on the last line.
That is not what my code is doing? I'm only moving past the byte order mark (BOM) and reading a single line. You can modify the procedure to return a handle to the opened file if you want to save some time. It is not a bottleneck for my use case.
Re: most efficient way to find eol?
Posted: Fri Dec 16, 2016 3:47 am
by IdeasVacuum
Hi skywalk
Yes I appreciate that is not what you are doing, I qualified my remark.