Find a string in *Memory

Just starting out? Need help? Post your questions and find answers here.
Little John
Addict
Addict
Posts: 4793
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Find a string in *Memory

Post by Little John »

I see.
User avatar
charvista
Addict
Addict
Posts: 949
Joined: Tue Sep 23, 2008 11:38 pm
Location: Belgium

Re: Find a string in *Memory

Post by charvista »

@infratec
I confirm that it is working very well.
Also thanks for your last addition for skywalk, for his #PB_UTF8 question, it also helped me to have the procedure for #PB_Ascii as well.
On a 15 GB file (that's about fifteen milliard characters! **), under #PB_Ascii, loading it in memory AND searching for a string in it, takes in total 10 seconds. :shock:

** For Americans, it is 15 billion. :wink:
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
infratec
Always Here
Always Here
Posts: 7625
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Find a string in *Memory

Post by infratec »

charvista wrote: Sun Mar 31, 2024 4:57 pm On a 15 GB file ...
You are not also a radio amateur like wimapon (pa0slt) and need to extract the lines with your own call sign :?:
https://www.purebasic.fr/english/viewtopic.php?t=83888

:mrgreen:
User avatar
charvista
Addict
Addict
Posts: 949
Joined: Tue Sep 23, 2008 11:38 pm
Location: Belgium

Re: Find a string in *Memory

Post by charvista »

infratec wrote: Mon Apr 01, 2024 10:14 am You are not also a radio amateur like wimapon (pa0slt) and need to extract the lines with your own call sign :?:
Ha ha, no, I am not a radio amateur, but indeed, the idea comes from Wimapon, and I found the problem very interesting. Be able to read such a large file was something I never did before. So I found out what are the limits of PureBasic, or the absence of it. For example, saying in the help file that a string .s is unlimited, is misleading, because it has a 2 GB barrier, meaning limited to 1 GB in Unicode. Reading past this limit X.s=Space() can be erratic without warning. As well the ReadData() is limited to 2 GB of buffer.
Memory is as far as I know, only limited to amount of RAM of the computer has. To fill the Memory, I need thus to fill it in blocks of 1 GB as shown in the code fragment below, and that is not so practical, but possible.
Your code for searching a word in a flashing speed is really fantastic. AZJIO's code was also very fast, but there is always a winner :wink:

Code: Select all

Fwim=OpenFile(#PB_Any,_DirData+"WIMAPON.CSV",#PB_Ascii); a 15 GB file !
    FileSZ=FileSize(_DirData+"WIMAPON.CSV")
    *MEM=AllocateMemory(FileSZ)
    BlockOffset=0
    BlockSize=1000000000
    FSZ=FileSZ
    
    Repeat; loop because ReadData() is limited to 2GB so doing steps of max 1'000'000'000 seems to be safe (half because I think ReadData() is Unicode, thus doubled)
        ReadData(Fwim,*MEM+BlockOffset,BlockSize)
        BlockOffset+BlockSize
        If FSZ>BlockSize
            FSZ-BlockSize
        Else
            BlockSize=FSZ
            FSZ=0
        EndIf
    Until FSZ=0
I must say that PureBasic is a fantastic programming language, and it continues to evolve. From April 1st, I'm using the new PB 6.10 and that's no April Fool. :mrgreen:
To boldly go....
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
infratec
Always Here
Always Here
Posts: 7625
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Find a string in *Memory

Post by infratec »

AZJIO
Addict
Addict
Posts: 2193
Joined: Sun May 14, 2017 1:48 am

Re: Find a string in *Memory

Post by AZJIO »

charvista wrote: Mon Apr 01, 2024 11:47 am AZJIO's code was also very fast, but there is always a winner :wink:
Do you want to create a competition? I showed the idea, but in each specific case there will be different approaches, so the current option will not be universal.
charvista wrote: Mon Apr 01, 2024 11:47 am ReadData() is Unicode
No. This function reads the file as is binary without changing its contents.
ReadData() returns the number of bytes read. If at the end of the file there is a discrepancy between the block size and the read data, then the criterion of the read data should be important to you, and not the size of the block of allocated memory for this data.
charvista wrote: Mon Apr 01, 2024 11:47 am because it has a 2 GB barrier,
Previously there was a limit of 2 GB per process. I remained silent because I was not sure that the situation had not changed now. That is, you must add all the variables, and not just take one. As soon as you copy the data into a variable, you create another 2/3 in Unicode. You cannot remove allocated memory before you create a Unicode variable. When I searched and replaced data in a file, I encountered a real limitation of 200 MB, since in addition to the fact that the data was in Unicode, it was also duplicated and it was necessary to store the result.
User avatar
charvista
Addict
Addict
Posts: 949
Joined: Tue Sep 23, 2008 11:38 pm
Location: Belgium

Re: Find a string in *Memory

Post by charvista »

infratec wrote: Mon Apr 01, 2024 12:53 pm Have you tested this too:
https://www.purebasic.fr/english/viewto ... 77#p618177
Not at that moment, but I did now.
And there is a significant speed difference between the code you wrote on the other topic and my final test program.
Your code (I only added the ElapsedMilliseconds()):

Code: Select all

EnableExplicit

#BufferSize = 65535


Define.i InFile, OutFile, ReadLen, Offset, Found, State, q0, q1, ms
Define *Buffer, *BufferEnd, *Ptr.Ascii, *LastEOL
Define Filename$, Line$


Filename$ = OpenFileRequester("Choose a CSV file", "", "CSV|*.csv", 0)
If Filename$
    InFile = ReadFile(#PB_Any, Filename$)
    If InFile
        
        q0.i=ElapsedMilliseconds()
        OutFile = CreateFile(#PB_Any, Filename$ + ".extract.csv")
        If OutFile

            q0=ElapsedMilliseconds()      
            *Buffer = AllocateMemory(#BufferSize, #PB_Memory_NoClear)
            If *Buffer
                *LastEOL = *Buffer
                
                
                While Not Eof(InFile)
                    
                    ReadLen = ReadData(InFile, *Buffer + Offset, MemorySize(*Buffer) - Offset)
                    
                    ;Debug ReadLen
                    
                    *BufferEnd = *Buffer + ReadLen
                    *Ptr = *Buffer
                    
                    While *Ptr < *BufferEnd
                        
                        If *Ptr\a < ' '
                            
                            If Found
                                WriteData(OutFile, *LastEOL, *Ptr - *LastEOL)
                                WriteStringN(OutFile, "")
                                Found = #False
                            EndIf
                            
                            *Ptr + 1
                            *LastEOL = *Ptr + 1
                        Else
                            Select State
                                Case 0 
                                    If *Ptr\a = 'P'
                                        State + 1
                                    EndIf
                                Case 1
                                    If *Ptr\a = 'A'
                                        State + 1
                                    Else
                                        State = 0
                                    EndIf
                                Case 2
                                    If *Ptr\a = '0'
                                        State + 1
                                    Else
                                        State = 0
                                    EndIf
                                Case 3
                                    If *Ptr\a = 'S'
                                        State + 1
                                    Else
                                        State = 0
                                    EndIf
                                Case 4
                                    If *Ptr\a = 'L'
                                        State + 1
                                    Else
                                        State = 0
                                    EndIf
                                Case 5
                                    If *Ptr\a = 'T'
                                        Found = #True
                                    EndIf
                                    State = 0
                            EndSelect
                        EndIf
                        
                        *Ptr + 1
                        
                    Wend
                    
                    Offset = *Ptr - *LastEOL
                    CopyMemory(*LastEOL, *Buffer, Offset)
                    *LastEOL = *Buffer
                    
                Wend
                
                FreeMemory(*Buffer)
            EndIf
            
            CloseFile(OutFile)
            
        EndIf
        
        CloseFile(InFile)
    EndIf
EndIf
q1=ElapsedMilliseconds()
ms=q1-q0
MessageRequester("Time elapsed",StrD(ms/1000,2)+" seconds")
and my code (using with your procedure):

Code: Select all

EnableExplicit
DisableDebugger

#SEP = Chr(166); separator
#GBytes = 1024 * 1024 * 1024

Procedure.i zFindMemoryAscii(*Memory,*ToFind.Ascii,*MemPos.string,FindFromEnd.i=#False,MemoryLength.i=0,ToFindLength.i=0,ReturnOffset.i=#False,*StartAddress.Ascii=#Null)
    Protected.i ByteLength, Result
    Protected *MemEnd, *Ptr.Ascii, *FoundAt, *Ptr2.Ascii, *ToFindTmp.Ascii, *ToFindEnd
    If ToFindLength = 0
        ByteLength = MemorySize(*ToFind)
    Else
        ByteLength = ToFindLength
    EndIf
    *ToFindTmp = *ToFind
    *ToFindEnd = *ToFind + ByteLength
    If FindFromEnd
        If *StartAddress
            *Ptr = *StartAddress - ByteLength
        Else
            *Ptr = *Memory + MemorySize(*Memory) - ByteLength
        EndIf
        Repeat
            If *Ptr\a = *ToFindTmp\a
                *Ptr2 = *Ptr + 1
                *ToFindTmp + 1
                While *Ptr2\a = *ToFindTmp\a
                    *Ptr2 + 1
                    *ToFindTmp + 1
                    If *ToFindTmp = *ToFindEnd
                        *FoundAt = *Ptr
                        Break 2
                    EndIf
                Wend
                *ToFindTmp = *ToFind
            EndIf
            *Ptr - 1
        Until *Ptr <= *Memory
    Else
        *MemEnd = *Memory + MemorySize(*Memory) - ByteLength
        If *StartAddress
            If *StartAddress\a = *ToFind\a
                *Ptr = *StartAddress + 1
            Else
                *Ptr = *StartAddress
            EndIf
        Else
            *Ptr = *Memory
        EndIf
        Repeat
            If *Ptr\a = *ToFindTmp\a
                *Ptr2 = *Ptr + 1
                *ToFindTmp + 1
                While *Ptr2\a = *ToFindTmp\a
                    *Ptr2 + 1
                    *ToFindTmp + 1
                    If *ToFindTmp = *ToFindEnd
                        *FoundAt = *Ptr
                        Break 2
                    EndIf
                Wend
                *ToFindTmp = *ToFind
            EndIf
            *Ptr + 1
        Until *Ptr >= *MemEnd
    EndIf
    If ReturnOffset
        If *FoundAt
            *MemPos\s+Str(*FoundAt - *Memory)+#SEP; string containing all the found Offsets
        EndIf
        Result = *FoundAt - *Memory
    Else
        If *FoundAt
            *MemPos\s+Str(*FoundAt - *Memory)+#SEP; string containing all the found Offsets
        EndIf
        Result = *FoundAt
    EndIf
    ProcedureReturn Result
EndProcedure

Procedure.i zPos(MatchStr.s,ScanStr.s,Relate.s="=",IncVal.i=1,OccurVal.i=1)
    Protected.i LenMatchStr,LenScanStr,StartVal,EndVal,ScanPos,OccurCnt
    Protected.s Item
    If Relate=""
        Relate="="
    EndIf
    If IncVal=0
        IncVal=1
    EndIf
    If Relate="=" And IncVal=1 And OccurVal=1
        ProcedureReturn FindString(ScanStr,MatchStr); much faster
    EndIf
    LenMatchStr.i=Len(MatchStr)
    If FindString(":^",Relate,1)
        LenMatchStr=1
    EndIf
    LenScanStr.i=Len(ScanStr)
    If Not LenMatchStr Or Not LenScanStr Or OccurVal<0
        ProcedureReturn 0
    EndIf
    If IncVal>0
        StartVal.i=1
        EndVal.i=LenScanStr
    Else
        StartVal.i=LenScanStr
        EndVal.i=1
    EndIf
    ScanPos.i=StartVal
    OccurCnt.i=0
    While ScanPos*Sign(IncVal)<=EndVal*Sign(IncVal)
        Item.s=Mid(ScanStr,ScanPos,LenMatchStr)
        If Relate="="
            If MatchStr=Item : OccurCnt+1 : EndIf
        ElseIf Relate="<"
            If MatchStr<Item : OccurCnt+1 : EndIf
        ElseIf Relate=">"
            If MatchStr>Item : OccurCnt+1 : EndIf
        ElseIf Relate="<=" Or Relate="=<"
            If MatchStr<=Item : OccurCnt+1 : EndIf
        ElseIf Relate=">=" Or Relate="=>"
            If MatchStr>=Item : OccurCnt+1 : EndIf
        ElseIf Relate="<>" Or Relate="><"
            If MatchStr<>Item : OccurCnt+1 : EndIf
        ElseIf Relate=":"
            If FindString(MatchStr,Item,1) : OccurCnt+1 : EndIf
        ElseIf Relate="^"
            If Not FindString(MatchStr,Item,1) : OccurCnt+1 : EndIf
        Else
            ProcedureReturn 0
        EndIf
        If OccurVal>0
            If OccurCnt=OccurVal
                If OccurCnt
                    ProcedureReturn ScanPos
                Else
                    ProcedureReturn 0
                EndIf
            EndIf
        EndIf
        ScanPos+IncVal
    Wend
    If OccurVal=0
        ProcedureReturn OccurCnt
    EndIf
EndProcedure

Procedure.s zTrim(Var.s,Method.i=2,Char.s=" ")
    If Method.i=0 Or Method.i=2
        While Mid(Var.s,1,1)=Char.s
            Var.s=Mid(Var.s,2)
        Wend
    EndIf
    If Method.i=1 Or Method.i=2
        While Mid(Var.s,Len(Var.s),1)=Char.s
            Var.s=Mid(Var.s,1,Len(Var.s)-1)
        Wend
    EndIf
    ProcedureReturn Var.s
EndProcedure

Procedure.s zStringGetItem(String.s,ItemNumber.i,Delimiter.s=#SEP,TrimTrailingSpaces.i=0)
    Protected.i C
    Protected.s Extract
    If String.s>"" And Right(String.s,1)<>Delimiter.s
        String.s+Delimiter.s
    EndIf
    C.i=CountString(String.s,Delimiter.s)
    If ItemNumber.i>C.i Or ItemNumber.i=0
        ;zErr(47,"zStringGetItem()")
        ProcedureReturn ""
    EndIf
    Extract.s=Mid(String,zPos(Delimiter,Delimiter+String,"=",1,ItemNumber),zPos(Delimiter,String,"=",1,ItemNumber)-zPos(Delimiter,Delimiter+String,"=",1,ItemNumber))
    If TrimTrailingSpaces=1
        Extract=zTrim(Extract,1)
    EndIf
    ProcedureReturn Extract
EndProcedure



Define.i InFile, OutFile, ReadLen, Offset, Found, State, q0, q1, ms, BlockOffset, BlockSize, FileSZ, FSZ, NoOfFoundItems, i, j
Define *Mem, *SearchStr, *Offset
Define Filename$, Line$
Define.s SearchStr
Define.string MemPos


Filename$ = OpenFileRequester("Choose a CSV file", "", "CSV|*.csv", 0)
If Filename$
    
    InFile = OpenFile(#PB_Any, Filename$, #PB_Ascii)
    If InFile
        
        q0.i=ElapsedMilliseconds()
        OutFile = CreateFile(#PB_Any, Filename$ + ".extract.csv")
        If OutFile
            
            SearchStr.s="JF5UJS"; "PA0SLT"; "JF5UJS"; "VT3UIT""
            *SearchStr=Ascii(SearchStr)
            FileSZ=FileSize(Filename$)
            *Mem=AllocateMemory(FileSZ)
            BlockOffset=0
            BlockSize=#GBytes
            FSZ=FileSZ
            
            Repeat
                ReadData(InFile, *Mem+BlockOffset, BlockSize)
                BlockOffset+BlockSize
                If FSZ>BlockSize
                    FSZ-BlockSize
                Else
                    BlockSize=FSZ
                    FSZ=0
                EndIf
            Until FSZ=0
            
            MemPos\s=""
            *Offset=0
            Repeat
                *Offset=zFindMemoryAscii(*Mem,*SearchStr,@MemPos,#False,0,MemorySize(*SearchStr)-1,#False,*Offset)
            Until Not *Offset
            
            If MemPos\s=""
                MessageRequester("Empty","The selected item "+SearchStr+" was not found.")
            Else
                ;MessageRequester("Filled","The selected item "+SearchStr+" was found at:"+#CRLF$+MemPos\s)
                NoOfFoundItems=zPos(#SEP,MemPos\s,"=",1,0); get the number of occurences of the separator
                                                          ;MessageRequester("How many?","There are "+Str(NoOfFoundItems)+" found")
                Dim All.i(NoOfFoundItems,2)
                Dim Ln.s(NoOfFoundItems)
                For i=0 To NoOfFoundItems-1
                    All(i,0)=Val(zStringGetItem(MemPos\s,i+1,#SEP))
                    ;MessageRequester("now getting #"+Str(i+1)+" at offset ",Str(All(i,0)))
                    For j=All(i,0) To All(i,0)-500 Step -1
                        If j=0
                            All(i,1)=0
                            Break
                        ElseIf PeekS(*Mem+j-1,1,#PB_Ascii)=Chr(#LF); the last character from Hex 0D0A
                            All(i,1)=j
                            Break
                        EndIf
                    Next j
                    For j=All(i,0) To All(i,0)+500
                        If PeekS(*Mem+j,1,#PB_Ascii)=Chr(#CR); the first character from Hex 0D0A
                            All(i,2)=j-1
                            Break
                        EndIf
                    Next j
                    Ln(i)=PeekS(*Mem+All(i,1),All(i,2)-All(i,1)+1,#PB_Ascii)
                    WriteStringN(OutFile, Ln(i),#PB_Ascii)
                    ;MessageRequester("Found "+SearchStr+" at Offset "+Str(All(i,0)), "From "+Str(All(i,1))+" to "+Str(All(i,2))+#CRLF$+Ln(i)+#CRLF$+#CRLF$+"2nd item (# seconds since 1970-01-01:"+#CRLF$+zStringGetItem(Ln(i),2,","))
                Next i
            EndIf
            CloseFile(OutFile)
            FreeMemory(*Mem)
        EndIf
        CloseFile(InFile)
    EndIf
    q1.i=ElapsedMilliseconds()
    ms=q1-q0
    MessageRequester("Time elapsed",StrD(ms/1000,2)+" seconds")
EndIf
I modified your procedure to export an additional MemPos\s byRef, that will collect the offsets in a string separated with a separator.
I also added three procedures to avoid missing procedures.
On my computer, your code took around 21.93ms and mine 10.54ms.
Better compile them in independent .exe files to reflect the real operation speed (=No debugger).
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
normeus
Enthusiast
Enthusiast
Posts: 472
Joined: Fri Apr 20, 2012 8:09 pm
Contact:

Re: Find a string in *Memory

Post by normeus »

The true winners are purebasic users that benefit from al the great code @infratec, @AZJIO and all others kindly provide. Thank you for your code and the time you spend workling on it.

Norm.
google Translate;Makes my jokes fall flat- Fait mes blagues tombent à plat- Machte meine Witze verpuffen- Eh cumpari ci vo sunari
User avatar
charvista
Addict
Addict
Posts: 949
Joined: Tue Sep 23, 2008 11:38 pm
Location: Belgium

Re: Find a string in *Memory

Post by charvista »

Do you want to create a competition?
No, but the fastest way is obviously the most interesting way (we are too lazy to wait :mrgreen: )
I agree that different approaches are interesting, we are learning from them (we understand a lot by seeing what happens).
ReadData() is Unicode
No.
Good to know, thank you. I then tested with a doubled BlockSize, but the final speed is the same, so I left the code at a safe 1 GB block.
Previously there was a limit of 2 GB per process.
Technology evolves. But yes, you are right, we must be cautious, as Windows itself and the other programs running in the same process are all consuming memory, so we must balance between memory usage and speed. Testing is the best way to see if all is good and well. :wink:
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
User avatar
charvista
Addict
Addict
Posts: 949
Joined: Tue Sep 23, 2008 11:38 pm
Location: Belgium

Re: Find a string in *Memory

Post by charvista »

normeus wrote: Mon Apr 01, 2024 5:44 pm The true winners are purebasic users that benefit from al the great code @infratec, @AZJIO and all others kindly provide. Thank you for your code and the time you spend workling on it.
You're welcome 8)
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
infratec
Always Here
Always Here
Posts: 7625
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Find a string in *Memory

Post by infratec »

I set the buffer only to 65536 bytes.
Please increase it to 100MB or 1GB and test it again.

Btw. I don't know what kind of file the CSV is.
I assumed it is an ascii file.
User avatar
charvista
Addict
Addict
Posts: 949
Joined: Tue Sep 23, 2008 11:38 pm
Location: Belgium

Re: Find a string in *Memory

Post by charvista »

Please increase it to 100MB or 1GB and test it again.
I tested 3 times every buffer configuration (to see if it is regular), here are the results:
22.06s - 22.04s - 22.21s with buffer=32'768
21.31s - 21.15s - 21.27s with buffer=65'536
20.89s - 20.83s - 20.86s with buffer=256'000
21.13s - 21.42s - 21.34s with buffer=500'000
20.91s - 20.60s - 20.68s with buffer=1'000'000
21.43s - 21.50s - 21.52s with buffer=500'000'000
20.91s - 20.60s - 20.68s with buffer=1'000'000'000
22.31s - 22.43s - 22.41s with buffer=15'000'000'000
Conclusion: maybe increase the buffer to 1MB is the best (safest) option, so you win 1 second, and does not use too much memory.

On the other hand, testing my version, gives different values.
11.08s - 11.14s - 10.99s with buffer=65'536
10.96s - 11.05s - 10.94s with buffer=1'000'000
10.90s - 10.92s - 10.80s with buffer=1'000'000'000
08.19s - 08.13s - 08.10s with buffer=15'000'000'000
When the whole 15GB file is loaded into memory, we get the best results (it went once even slightly under the 8 seconds!)
Btw. I don't know what kind of file the CSV is.
I assumed it is an ascii file.
A .csv file is the format we export the data of an MS-Excel file in a readable format (so, yes, it is a pure Ascii file (or maybe even Unicode in the case there are special characters, I have not tested that, but wimapon's csv file is Ascii)), with the data separated with commas, or another character specified. So other programs that support .csv files can import them.
The free LibreOffice's spreadsheet can do that too, not as "export", but as "Save As".
- Windows 11 Home 64-bit
- PureBasic 6.10 LTS (x64)
- 64 Gb RAM
- 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz
- 5K monitor with DPI @ 200%
AZJIO
Addict
Addict
Posts: 2193
Joined: Sun May 14, 2017 1:48 am

Re: Find a string in *Memory

Post by AZJIO »

charvista
Example using "Select State": you can avoid wasting memory at all using ReadByte()
Post Reply