Find a string in *Memory
-
- Addict
- Posts: 4793
- Joined: Thu Jun 07, 2007 3:25 pm
- Location: Berlin, Germany
Re: Find a string in *Memory
@infratec
I confirm that it is working very well.
Also thanks for your last addition for skywalk, for his #PB_UTF8 question, it also helped me to have the procedure for #PB_Ascii as well.
On a 15 GB file (that's about fifteen milliard characters! **), under #PB_Ascii, loading it in memory AND searching for a string in it, takes in total 10 seconds.
** For Americans, it is 15 billion.
I confirm that it is working very well.
Also thanks for your last addition for skywalk, for his #PB_UTF8 question, it also helped me to have the procedure for #PB_Ascii as well.
On a 15 GB file (that's about fifteen milliard characters! **), under #PB_Ascii, loading it in memory AND searching for a string in it, takes in total 10 seconds.

** For Americans, it is 15 billion.

- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
Re: Find a string in *Memory
You are not also a radio amateur like wimapon (pa0slt) and need to extract the lines with your own call sign

https://www.purebasic.fr/english/viewtopic.php?t=83888

Re: Find a string in *Memory
Ha ha, no, I am not a radio amateur, but indeed, the idea comes from Wimapon, and I found the problem very interesting. Be able to read such a large file was something I never did before. So I found out what are the limits of PureBasic, or the absence of it. For example, saying in the help file that a string .s is unlimited, is misleading, because it has a 2 GB barrier, meaning limited to 1 GB in Unicode. Reading past this limit X.s=Space() can be erratic without warning. As well the ReadData() is limited to 2 GB of buffer.infratec wrote: Mon Apr 01, 2024 10:14 am You are not also a radio amateur like wimapon (pa0slt) and need to extract the lines with your own call sign![]()
Memory is as far as I know, only limited to amount of RAM of the computer has. To fill the Memory, I need thus to fill it in blocks of 1 GB as shown in the code fragment below, and that is not so practical, but possible.
Your code for searching a word in a flashing speed is really fantastic. AZJIO's code was also very fast, but there is always a winner

Code: Select all
Fwim=OpenFile(#PB_Any,_DirData+"WIMAPON.CSV",#PB_Ascii); a 15 GB file !
FileSZ=FileSize(_DirData+"WIMAPON.CSV")
*MEM=AllocateMemory(FileSZ)
BlockOffset=0
BlockSize=1000000000
FSZ=FileSZ
Repeat; loop because ReadData() is limited to 2GB so doing steps of max 1'000'000'000 seems to be safe (half because I think ReadData() is Unicode, thus doubled)
ReadData(Fwim,*MEM+BlockOffset,BlockSize)
BlockOffset+BlockSize
If FSZ>BlockSize
FSZ-BlockSize
Else
BlockSize=FSZ
FSZ=0
EndIf
Until FSZ=0

To boldly go....
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
Re: Find a string in *Memory
Have you tested this too:
https://www.purebasic.fr/english/viewto ... 77#p618177
https://www.purebasic.fr/english/viewto ... 77#p618177
Re: Find a string in *Memory
Do you want to create a competition? I showed the idea, but in each specific case there will be different approaches, so the current option will not be universal.charvista wrote: Mon Apr 01, 2024 11:47 am AZJIO's code was also very fast, but there is always a winner![]()
No. This function reads the file as is binary without changing its contents.
ReadData() returns the number of bytes read. If at the end of the file there is a discrepancy between the block size and the read data, then the criterion of the read data should be important to you, and not the size of the block of allocated memory for this data.
Previously there was a limit of 2 GB per process. I remained silent because I was not sure that the situation had not changed now. That is, you must add all the variables, and not just take one. As soon as you copy the data into a variable, you create another 2/3 in Unicode. You cannot remove allocated memory before you create a Unicode variable. When I searched and replaced data in a file, I encountered a real limitation of 200 MB, since in addition to the fact that the data was in Unicode, it was also duplicated and it was necessary to store the result.
Re: Find a string in *Memory
Not at that moment, but I did now.infratec wrote: Mon Apr 01, 2024 12:53 pm Have you tested this too:
https://www.purebasic.fr/english/viewto ... 77#p618177
And there is a significant speed difference between the code you wrote on the other topic and my final test program.
Your code (I only added the ElapsedMilliseconds()):
Code: Select all
EnableExplicit
#BufferSize = 65535
Define.i InFile, OutFile, ReadLen, Offset, Found, State, q0, q1, ms
Define *Buffer, *BufferEnd, *Ptr.Ascii, *LastEOL
Define Filename$, Line$
Filename$ = OpenFileRequester("Choose a CSV file", "", "CSV|*.csv", 0)
If Filename$
InFile = ReadFile(#PB_Any, Filename$)
If InFile
q0.i=ElapsedMilliseconds()
OutFile = CreateFile(#PB_Any, Filename$ + ".extract.csv")
If OutFile
q0=ElapsedMilliseconds()
*Buffer = AllocateMemory(#BufferSize, #PB_Memory_NoClear)
If *Buffer
*LastEOL = *Buffer
While Not Eof(InFile)
ReadLen = ReadData(InFile, *Buffer + Offset, MemorySize(*Buffer) - Offset)
;Debug ReadLen
*BufferEnd = *Buffer + ReadLen
*Ptr = *Buffer
While *Ptr < *BufferEnd
If *Ptr\a < ' '
If Found
WriteData(OutFile, *LastEOL, *Ptr - *LastEOL)
WriteStringN(OutFile, "")
Found = #False
EndIf
*Ptr + 1
*LastEOL = *Ptr + 1
Else
Select State
Case 0
If *Ptr\a = 'P'
State + 1
EndIf
Case 1
If *Ptr\a = 'A'
State + 1
Else
State = 0
EndIf
Case 2
If *Ptr\a = '0'
State + 1
Else
State = 0
EndIf
Case 3
If *Ptr\a = 'S'
State + 1
Else
State = 0
EndIf
Case 4
If *Ptr\a = 'L'
State + 1
Else
State = 0
EndIf
Case 5
If *Ptr\a = 'T'
Found = #True
EndIf
State = 0
EndSelect
EndIf
*Ptr + 1
Wend
Offset = *Ptr - *LastEOL
CopyMemory(*LastEOL, *Buffer, Offset)
*LastEOL = *Buffer
Wend
FreeMemory(*Buffer)
EndIf
CloseFile(OutFile)
EndIf
CloseFile(InFile)
EndIf
EndIf
q1=ElapsedMilliseconds()
ms=q1-q0
MessageRequester("Time elapsed",StrD(ms/1000,2)+" seconds")
Code: Select all
EnableExplicit
DisableDebugger
#SEP = Chr(166); separator
#GBytes = 1024 * 1024 * 1024
Procedure.i zFindMemoryAscii(*Memory,*ToFind.Ascii,*MemPos.string,FindFromEnd.i=#False,MemoryLength.i=0,ToFindLength.i=0,ReturnOffset.i=#False,*StartAddress.Ascii=#Null)
Protected.i ByteLength, Result
Protected *MemEnd, *Ptr.Ascii, *FoundAt, *Ptr2.Ascii, *ToFindTmp.Ascii, *ToFindEnd
If ToFindLength = 0
ByteLength = MemorySize(*ToFind)
Else
ByteLength = ToFindLength
EndIf
*ToFindTmp = *ToFind
*ToFindEnd = *ToFind + ByteLength
If FindFromEnd
If *StartAddress
*Ptr = *StartAddress - ByteLength
Else
*Ptr = *Memory + MemorySize(*Memory) - ByteLength
EndIf
Repeat
If *Ptr\a = *ToFindTmp\a
*Ptr2 = *Ptr + 1
*ToFindTmp + 1
While *Ptr2\a = *ToFindTmp\a
*Ptr2 + 1
*ToFindTmp + 1
If *ToFindTmp = *ToFindEnd
*FoundAt = *Ptr
Break 2
EndIf
Wend
*ToFindTmp = *ToFind
EndIf
*Ptr - 1
Until *Ptr <= *Memory
Else
*MemEnd = *Memory + MemorySize(*Memory) - ByteLength
If *StartAddress
If *StartAddress\a = *ToFind\a
*Ptr = *StartAddress + 1
Else
*Ptr = *StartAddress
EndIf
Else
*Ptr = *Memory
EndIf
Repeat
If *Ptr\a = *ToFindTmp\a
*Ptr2 = *Ptr + 1
*ToFindTmp + 1
While *Ptr2\a = *ToFindTmp\a
*Ptr2 + 1
*ToFindTmp + 1
If *ToFindTmp = *ToFindEnd
*FoundAt = *Ptr
Break 2
EndIf
Wend
*ToFindTmp = *ToFind
EndIf
*Ptr + 1
Until *Ptr >= *MemEnd
EndIf
If ReturnOffset
If *FoundAt
*MemPos\s+Str(*FoundAt - *Memory)+#SEP; string containing all the found Offsets
EndIf
Result = *FoundAt - *Memory
Else
If *FoundAt
*MemPos\s+Str(*FoundAt - *Memory)+#SEP; string containing all the found Offsets
EndIf
Result = *FoundAt
EndIf
ProcedureReturn Result
EndProcedure
Procedure.i zPos(MatchStr.s,ScanStr.s,Relate.s="=",IncVal.i=1,OccurVal.i=1)
Protected.i LenMatchStr,LenScanStr,StartVal,EndVal,ScanPos,OccurCnt
Protected.s Item
If Relate=""
Relate="="
EndIf
If IncVal=0
IncVal=1
EndIf
If Relate="=" And IncVal=1 And OccurVal=1
ProcedureReturn FindString(ScanStr,MatchStr); much faster
EndIf
LenMatchStr.i=Len(MatchStr)
If FindString(":^",Relate,1)
LenMatchStr=1
EndIf
LenScanStr.i=Len(ScanStr)
If Not LenMatchStr Or Not LenScanStr Or OccurVal<0
ProcedureReturn 0
EndIf
If IncVal>0
StartVal.i=1
EndVal.i=LenScanStr
Else
StartVal.i=LenScanStr
EndVal.i=1
EndIf
ScanPos.i=StartVal
OccurCnt.i=0
While ScanPos*Sign(IncVal)<=EndVal*Sign(IncVal)
Item.s=Mid(ScanStr,ScanPos,LenMatchStr)
If Relate="="
If MatchStr=Item : OccurCnt+1 : EndIf
ElseIf Relate="<"
If MatchStr<Item : OccurCnt+1 : EndIf
ElseIf Relate=">"
If MatchStr>Item : OccurCnt+1 : EndIf
ElseIf Relate="<=" Or Relate="=<"
If MatchStr<=Item : OccurCnt+1 : EndIf
ElseIf Relate=">=" Or Relate="=>"
If MatchStr>=Item : OccurCnt+1 : EndIf
ElseIf Relate="<>" Or Relate="><"
If MatchStr<>Item : OccurCnt+1 : EndIf
ElseIf Relate=":"
If FindString(MatchStr,Item,1) : OccurCnt+1 : EndIf
ElseIf Relate="^"
If Not FindString(MatchStr,Item,1) : OccurCnt+1 : EndIf
Else
ProcedureReturn 0
EndIf
If OccurVal>0
If OccurCnt=OccurVal
If OccurCnt
ProcedureReturn ScanPos
Else
ProcedureReturn 0
EndIf
EndIf
EndIf
ScanPos+IncVal
Wend
If OccurVal=0
ProcedureReturn OccurCnt
EndIf
EndProcedure
Procedure.s zTrim(Var.s,Method.i=2,Char.s=" ")
If Method.i=0 Or Method.i=2
While Mid(Var.s,1,1)=Char.s
Var.s=Mid(Var.s,2)
Wend
EndIf
If Method.i=1 Or Method.i=2
While Mid(Var.s,Len(Var.s),1)=Char.s
Var.s=Mid(Var.s,1,Len(Var.s)-1)
Wend
EndIf
ProcedureReturn Var.s
EndProcedure
Procedure.s zStringGetItem(String.s,ItemNumber.i,Delimiter.s=#SEP,TrimTrailingSpaces.i=0)
Protected.i C
Protected.s Extract
If String.s>"" And Right(String.s,1)<>Delimiter.s
String.s+Delimiter.s
EndIf
C.i=CountString(String.s,Delimiter.s)
If ItemNumber.i>C.i Or ItemNumber.i=0
;zErr(47,"zStringGetItem()")
ProcedureReturn ""
EndIf
Extract.s=Mid(String,zPos(Delimiter,Delimiter+String,"=",1,ItemNumber),zPos(Delimiter,String,"=",1,ItemNumber)-zPos(Delimiter,Delimiter+String,"=",1,ItemNumber))
If TrimTrailingSpaces=1
Extract=zTrim(Extract,1)
EndIf
ProcedureReturn Extract
EndProcedure
Define.i InFile, OutFile, ReadLen, Offset, Found, State, q0, q1, ms, BlockOffset, BlockSize, FileSZ, FSZ, NoOfFoundItems, i, j
Define *Mem, *SearchStr, *Offset
Define Filename$, Line$
Define.s SearchStr
Define.string MemPos
Filename$ = OpenFileRequester("Choose a CSV file", "", "CSV|*.csv", 0)
If Filename$
InFile = OpenFile(#PB_Any, Filename$, #PB_Ascii)
If InFile
q0.i=ElapsedMilliseconds()
OutFile = CreateFile(#PB_Any, Filename$ + ".extract.csv")
If OutFile
SearchStr.s="JF5UJS"; "PA0SLT"; "JF5UJS"; "VT3UIT""
*SearchStr=Ascii(SearchStr)
FileSZ=FileSize(Filename$)
*Mem=AllocateMemory(FileSZ)
BlockOffset=0
BlockSize=#GBytes
FSZ=FileSZ
Repeat
ReadData(InFile, *Mem+BlockOffset, BlockSize)
BlockOffset+BlockSize
If FSZ>BlockSize
FSZ-BlockSize
Else
BlockSize=FSZ
FSZ=0
EndIf
Until FSZ=0
MemPos\s=""
*Offset=0
Repeat
*Offset=zFindMemoryAscii(*Mem,*SearchStr,@MemPos,#False,0,MemorySize(*SearchStr)-1,#False,*Offset)
Until Not *Offset
If MemPos\s=""
MessageRequester("Empty","The selected item "+SearchStr+" was not found.")
Else
;MessageRequester("Filled","The selected item "+SearchStr+" was found at:"+#CRLF$+MemPos\s)
NoOfFoundItems=zPos(#SEP,MemPos\s,"=",1,0); get the number of occurences of the separator
;MessageRequester("How many?","There are "+Str(NoOfFoundItems)+" found")
Dim All.i(NoOfFoundItems,2)
Dim Ln.s(NoOfFoundItems)
For i=0 To NoOfFoundItems-1
All(i,0)=Val(zStringGetItem(MemPos\s,i+1,#SEP))
;MessageRequester("now getting #"+Str(i+1)+" at offset ",Str(All(i,0)))
For j=All(i,0) To All(i,0)-500 Step -1
If j=0
All(i,1)=0
Break
ElseIf PeekS(*Mem+j-1,1,#PB_Ascii)=Chr(#LF); the last character from Hex 0D0A
All(i,1)=j
Break
EndIf
Next j
For j=All(i,0) To All(i,0)+500
If PeekS(*Mem+j,1,#PB_Ascii)=Chr(#CR); the first character from Hex 0D0A
All(i,2)=j-1
Break
EndIf
Next j
Ln(i)=PeekS(*Mem+All(i,1),All(i,2)-All(i,1)+1,#PB_Ascii)
WriteStringN(OutFile, Ln(i),#PB_Ascii)
;MessageRequester("Found "+SearchStr+" at Offset "+Str(All(i,0)), "From "+Str(All(i,1))+" to "+Str(All(i,2))+#CRLF$+Ln(i)+#CRLF$+#CRLF$+"2nd item (# seconds since 1970-01-01:"+#CRLF$+zStringGetItem(Ln(i),2,","))
Next i
EndIf
CloseFile(OutFile)
FreeMemory(*Mem)
EndIf
CloseFile(InFile)
EndIf
q1.i=ElapsedMilliseconds()
ms=q1-q0
MessageRequester("Time elapsed",StrD(ms/1000,2)+" seconds")
EndIf
I also added three procedures to avoid missing procedures.
On my computer, your code took around 21.93ms and mine 10.54ms.
Better compile them in independent .exe files to reflect the real operation speed (=No debugger).
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
Re: Find a string in *Memory
The true winners are purebasic users that benefit from al the great code @infratec, @AZJIO and all others kindly provide. Thank you for your code and the time you spend workling on it.
Norm.
Norm.
google Translate;Makes my jokes fall flat- Fait mes blagues tombent à plat- Machte meine Witze verpuffen- Eh cumpari ci vo sunari
Re: Find a string in *Memory
No, but the fastest way is obviously the most interesting way (we are too lazy to waitDo you want to create a competition?

I agree that different approaches are interesting, we are learning from them (we understand a lot by seeing what happens).
Good to know, thank you. I then tested with a doubled BlockSize, but the final speed is the same, so I left the code at a safe 1 GB block.ReadData() is Unicode
No.
Technology evolves. But yes, you are right, we must be cautious, as Windows itself and the other programs running in the same process are all consuming memory, so we must balance between memory usage and speed. Testing is the best way to see if all is good and well.Previously there was a limit of 2 GB per process.

- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
Re: Find a string in *Memory
You're welcomenormeus wrote: Mon Apr 01, 2024 5:44 pm The true winners are purebasic users that benefit from al the great code @infratec, @AZJIO and all others kindly provide. Thank you for your code and the time you spend workling on it.

- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
Re: Find a string in *Memory
I set the buffer only to 65536 bytes.
Please increase it to 100MB or 1GB and test it again.
Btw. I don't know what kind of file the CSV is.
I assumed it is an ascii file.
Please increase it to 100MB or 1GB and test it again.
Btw. I don't know what kind of file the CSV is.
I assumed it is an ascii file.
Re: Find a string in *Memory
I tested 3 times every buffer configuration (to see if it is regular), here are the results:Please increase it to 100MB or 1GB and test it again.
22.06s - 22.04s - 22.21s with buffer=32'768
21.31s - 21.15s - 21.27s with buffer=65'536
20.89s - 20.83s - 20.86s with buffer=256'000
21.13s - 21.42s - 21.34s with buffer=500'000
20.91s - 20.60s - 20.68s with buffer=1'000'000
21.43s - 21.50s - 21.52s with buffer=500'000'000
20.91s - 20.60s - 20.68s with buffer=1'000'000'000
22.31s - 22.43s - 22.41s with buffer=15'000'000'000
Conclusion: maybe increase the buffer to 1MB is the best (safest) option, so you win 1 second, and does not use too much memory.
On the other hand, testing my version, gives different values.
11.08s - 11.14s - 10.99s with buffer=65'536
10.96s - 11.05s - 10.94s with buffer=1'000'000
10.90s - 10.92s - 10.80s with buffer=1'000'000'000
08.19s - 08.13s - 08.10s with buffer=15'000'000'000
When the whole 15GB file is loaded into memory, we get the best results (it went once even slightly under the 8 seconds!)
A .csv file is the format we export the data of an MS-Excel file in a readable format (so, yes, it is a pure Ascii file (or maybe even Unicode in the case there are special characters, I have not tested that, but wimapon's csv file is Ascii)), with the data separated with commas, or another character specified. So other programs that support .csv files can import them.Btw. I don't know what kind of file the CSV is.
I assumed it is an ascii file.
The free LibreOffice's spreadsheet can do that too, not as "export", but as "Save As".
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
Re: Find a string in *Memory
charvista
Example using "Select State": you can avoid wasting memory at all using ReadByte()
Example using "Select State": you can avoid wasting memory at all using ReadByte()