quick reading a file of 15 GByte.... how?

SMaag · Post by **SMaag** » Mon Mar 25, 2024 12:52 pm

@wimapon

That's same problem I have. Very huge industrial production log-Files in CSV-style to analyse. For that purpose I started a FastString and a CSV Project. But it is still under dvelopment. Maybe we can workt together on this!

The way to solve that problem is:

1. Read the Files in Blocks into Memory. Check availible RAM and decide how much you can use. Best is to use 4k aligned reads because that's the sektor size from HDD and the page size of x64 OS.
For a universal Buffer handling I wrote a module which I want to use for CSV and BigString handling!
https://github.com/Maagic7/PureBasicFra ... _Buffer.pb

2. Search in the block for your String.
At the postion found, you have to search for LineStart and LineEnd
- to get the Start of the line search reverse from the found string for EndOfLine
- to get the End of the line search forward for EndOfLine
- Copy the Line out of the memory

!EndOfLine might be different on Windows and Linux!

Here is the link to the SSE-String functions
This needs SSE 4.2 support
https://www.purebasic.fr/english/viewtopic.php?t=83316
https://github.com/Maagic7/PureBasicFra ... ringSSE.pb

an other FastString Module, wich works with standard SSE is here
https://github.com/Maagic7/PureBasicFra ... tString.pb

That's all code I wrote for industrial BigString handling and GB sized files!
But for the moment it's a low priority project for me.

wimapon · Post by **wimapon** » Mon Mar 25, 2024 12:54 pm

Okay folks,
i will experiment whith that.
so thank you very much for your help.

It is always very nice to get help when you have a problem !!!

Wim

jacdelad · Post by **jacdelad** » Mon Mar 25, 2024 1:00 pm

How about reading a huge block and parsing it with RegEx? Then read the next block and SK on. The only thing to do is make sure the end of the block ends with a line break (move the file pointer back to the last line break for the next read operation) and test it with several block sizes.

useful · Post by **useful** » Mon Mar 25, 2024 1:13 pm

wimapon wrote: Mon Mar 25, 2024 12:37 pm It depends of the kind of processing i am doing.
something between 2 and 10 times a day when i am processing.
And it is rather irritating to wait 2 minutes when you are busy and thinking.

It's impossible to understand! Do you need to process from 2 to 10 files of 15 gigabytes each? Or are you processing the same file multiple times?
You wrote that you are making a selection according to the "PA0SLT" criterion and this is enough
for you.
P.S. I am almost sure that you will inevitably come to the need to apply some kind of indexing mechanism.
P.P.S. In this case, indexing can be done in the process of data accumulation, and not before the analysis itself.

NicTheQuick · Post by **NicTheQuick** » Mon Mar 25, 2024 1:21 pm

I just tried something similar with GNU grep and needed 1 minute and 24 seconds for finding a small string in a 19 GiB file which means that it was able to search through it with ~230 MiB/s.
That was on a quite fast PC (see my profile). Interestingly the CPU time needed was just 2,8 seconds for searching and 3,9 seconds for kernel calls. I am not sure what took the rest of the time. Since my NVME SSD is capable of reading 7 GiB/s it is definitely not the bottle neck.

Anyway, GNU grep (source here) is usually quite efficient with finding stuff and outputting the line with the match. A lot of brain juice was put into that tool. So I don't think it will get much faster than this without preprocessing and indexing your data.

Another idea: What about creating a background job that is constantly waiting for these kind of files you are talking about and start processing them. Or does the search pattern change every time? There are different search algorithm out there that can boost performance in certain circumstances. Like for example if you are searching the same thing over and over you can use the Aho–Corasick algorithm, and if you want to search different patterns on the same data a Suffix-Array-Induced-Sorting could be helpful.

infratec · Post by **infratec** » Mon Mar 25, 2024 2:26 pm

You can also try this:

Code: Select all

EnableExplicit

#BufferSize = 65535


Define.i InFile, OutFile, ReadLen, Offset, Found, State
Define *Buffer, *BufferEnd, *Ptr.Ascii, *LastEOL
Define Filename$, Line$


Filename$ = OpenFileRequester("Choose a CSV file", "", "CSV|*.csv", 0)
If Filename$
  InFile = ReadFile(#PB_Any, Filename$)
  If InFile
    
    OutFile = CreateFile(#PB_Any, Filename$ + ".extract.csv")
    If OutFile
      
      *Buffer = AllocateMemory(#BufferSize, #PB_Memory_NoClear)
      If *Buffer
        *LastEOL = *Buffer
        
        
        While Not Eof(InFile)
          
          ReadLen = ReadData(InFile, *Buffer + Offset, MemorySize(*Buffer) - Offset)
          
          ;Debug ReadLen
          
          *BufferEnd = *Buffer + ReadLen
          *Ptr = *Buffer
          
          While *Ptr < *BufferEnd
            
            If *Ptr\a < ' '
              
              If Found
                WriteData(OutFile, *LastEOL, *Ptr - *LastEOL)
                WriteStringN(OutFile, "")
                Found = #False
              EndIf
              
              *Ptr + 1
              *LastEOL = *Ptr + 1
            Else
              Select State
                Case 0 
                  If *Ptr\a = 'P'
                    State + 1
                  EndIf
                Case 1
                  If *Ptr\a = 'A'
                    State + 1
                  Else
                    State = 0
                  EndIf
                Case 2
                  If *Ptr\a = '0'
                    State + 1
                  Else
                    State = 0
                  EndIf
                Case 3
                  If *Ptr\a = 'S'
                    State + 1
                  Else
                    State = 0
                  EndIf
                Case 4
                  If *Ptr\a = 'L'
                    State + 1
                  Else
                    State = 0
                  EndIf
                Case 5
                  If *Ptr\a = 'T'
                    Found = #True
                  EndIf
                  State = 0
              EndSelect
            EndIf
            
            *Ptr + 1
            
          Wend
          
          Offset = *Ptr - *LastEOL
          CopyMemory(*LastEOL, *Buffer, Offset)
          *LastEOL = *Buffer
          
        Wend
        
        FreeMemory(*Buffer)
      EndIf
      
      CloseFile(OutFile)
    EndIf
    
    CloseFile(InFile)
  EndIf
EndIf

But it depends on:
1. File is stored in ASCII format
2. EOL are 2 characters

You can increase the buffer size constant to improve speed.

SMaag · Post by **SMaag** » Tue Mar 26, 2024 12:12 pm

here is a timing demo. If i did it right, it is a search in a 512 Char = 1kB String
the search string is at the end of the Big$

Then do this for Loops = (1024*1024) * 30
should be 30GB String to search in total

My Ryzen 5800 needs 492ms to do this!
It seems to be to fast for me. I hope there is no bug!

Code: Select all

EnableExplicit

Procedure.i SSE_FindStr(*String, *StringToFind)
; Attention: This function is in beta state!
; ============================================================================
; NAME: SSE_FindStr
; DESC: Try to find StringToFind in String with SSE operation (PCmpIStrI)
; DESC: Search for the needle in the haystack
; DESC: This Function is for 2Byte Character Strings only
; VAR(*String): Pointer to String (Haystack)
; VAR(*StringToFind): Pointer to StringToFind (Needle)
; RET.i: If found: The startposition in Characters [1..n]. Otherwise 0
; ============================================================================    
      
  ;DisableDebugger
  
  ; TODO! Solve the 16Byte align problem
  
  CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
    
    CompilerIf #PB_Compiler_64Bit 
      Protected memRAX, memRDX
      ; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1. 
      ; The matching process does not include the terminating null-characters, but it stops there
      ; RAX = haystack (Heuhaufen), RDX = needle (Nadel)
      
      ; XMM0 XMM1 XMM2 XMM3 XMM4
      ; XMM1 = [String1] : XMM2=[String2]
      
      !MOV RAX, [p.p_String]        ; haystack
      !MOV RDX, [p.p_StringToFind]  ; needle
      !MOVDQU XMM2, DQWORD[RDX] ; load the first 16 bytes of neddle (String to find)
  
     	!SUB RAX, 16		; Avoid extra jump in main loop
         
      ; ----------------------------------------------------------------------
      ; Find the first possible match of 16-byte fragment in haystack
      ; ----------------------------------------------------------------------
      !FindStr_MainLoop:
        !ADD RAX, 16      ; Step up Counter
        !MOVDQU XMM1, DQWORD[RAX]
       ;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
        !PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
        ; now RCX contains the offset in WORDS where a match was found
       	; Loop while ZF=0 and CF=0:
      	;	1) We find a null in s1(RAX) ZF=1
        ;	2) We find a char that does not match CF=1 
      !JA FindStr_MainLoop
      ; Jump if CF=0, we found only matching chars  
      !JNC FindStr_StrNotFound
      
      ; possible match found at WordOffset in RCX
      !ADD RCX, RCX ; Word to Byte
      !ADD RAX, RCX ; save the possible match start
              
      !MOV [p.v_memRDX], RDX ; mov edi, edx; save RDX
      !MOV [p.v_memRAX], RAX ; mov esi, eax; save RAX
      
      ; ----------------------------------------------------------------------
      ; Compare String, at possible match postion in haystack, with needle
      ; ----------------------------------------------------------------------
      !SUB RDX, RAX
      !SUB RAX, 16  ; counter
      
      !PXOR XMM3, XMM3          ; XMM3 = 0
      
      ; compare the strings
      !FindStr_Compare:
        !ADD RAX, 16  ; Counter
        !MOVDQU XMM1, DQWORD[RAX+RDX] ; Haystack          
        ; mask out invalid bytes in the haystack
       ;!PCMPISTRM XMM3, XMM1, 1011000b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK  ; for ASCII Strings
        !PCMPISTRM XMM3, XMM1, 1011001b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
        ; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
        !MOVDQU XMM4, DQWORD[RAX] ; haystack  
        !PAND XMM4, XMM0
        
       ;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
        !PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
       	; Loop while ZF=0 and CF=0:
      	;	1) We find a null in s1(RDX+RCX) ZF=1      {JA CF=0 & ZF=0} {JE : ZF=1)
        ;	2) We find a char that does not match CF=1 {JC, JNC}
        ; 3) We find a null in s2 SF=1               {JS, JNS}
        ;!JS FindStr_StrNotFound 
      !JA FindStr_Compare ; CF=0 AND ZF=0
      
      !MOV RDX, [p.v_memRDX]
      !MOV RAX, [p.v_memRAX]
      !JNC FindStr_StrFound
      
      ;!SUB RAX, 15  ; for ASCII Strings
      !SUB RAX, 14
      !JMP FindStr_MainLoop
      
      !FindStr_StrNotFound:
        !XOR RAX, RAX
        !JMP FindStr_End
        
      !FindStr_StrFound:
        ; because RAX contains the Pointer we have to calculate the Char-No.
        !SUB RAX, [p.p_String]    ; Sub the Haystack Start-Pointer
        !SHR RAX, 1  ; Byte to Word: not needed for ASCII Strings
        !ADD RAX, 1  ; Add 1 to start with 1 as first Char-No.
      !FindStr_End:
      ProcedureReturn  ; !RAX

    CompilerElse  ; #PB_Compiler_32Bit
      
      Protected memEAX, memEDX
      ; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1. 
      ; The matching process does not include the terminating null-characters, but it stops there
      ; RAX = haystack (Heuhaufen), EDX = needle (Nadel)
      
      ; XMM0 XMM1 XMM2 XMM3 XMM4
      ; XMM1 = [String1] : XMM2=[String2]
      
      !MOV EAX, [p.p_String]        ; haystack
      !MOV EDX, [p.p_StringToFind]  ; needle
      !MOVDQU XMM2, DQWORD[EDX] ; load the first 16 bytes of neddle (String to find)
  
     	!SUB EAX, 16		; Avoid extra jump in main loop
         
      ; ----------------------------------------------------------------------
      ; Find the first possible match of 16-byte fragment in haystack
      ; ----------------------------------------------------------------------
      !FindStr_MainLoop:
        !ADD EAX, 16      ; Step up Counter
        !MOVDQU XMM1, DQWORD[EAX]
       ;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
        !PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
        ; now RCX contains the offset in WORDS where a match was found
       	; Loop while ZF=0 and CF=0:
      	;	1) We find a null in s1(EAX) ZF=1
        ;	2) We find a char that does not match CF=1 
      !JA FindStr_MainLoop
      ; Jump if CF=0, we found only matching chars  
      !JNC FindStr_StrNotFound
      
      ; possible match found at WordOffset in ECX
      !ADD ECX, ECX ; Word to Byte
      !ADD EAX, ECX ; save the possible match start
              
      !MOV [p.v_memEDX], EDX ; mov edi, edx; save EDX
      !MOV [p.v_memEAX], EAX ; mov esi, eax; save EAX
      
      ; ----------------------------------------------------------------------
      ; Compare String, at possible match postion in haystack, with needle
      ; ----------------------------------------------------------------------
      !SUB EDX, EAX
      !SUB EAX, 16  ; counter
      
      !PXOR XMM3, XMM3          ; XMM3 = 0
      
      ; compare the strings
      !FindStr_Compare:
        !ADD EAX, 16  ; Counter
        !MOVDQU XMM1, DQWORD[EAX+EDX] ; Haystack          
        ; mask out invalid bytes in the haystack
       ;!PCMPISTRM XMM3, XMM1, 1011000b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK  ; for ASCII Strings
        !PCMPISTRM XMM3, XMM1, 1011001b   ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
        ; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
        !MOVDQU XMM4, DQWORD[EAX] ; haystack  
        !PAND XMM4, XMM0
        
       ;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
        !PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
       	; Loop while ZF=0 and CF=0:
      	;	1) We find a null in s1(EDX+ECX) ZF=1      {JA CF=0 & ZF=0} {JE : ZF=1)
        ;	2) We find a char that does not match CF=1 {JC, JNC}
        ; 3) We find a null in s2 SF=1               {JS, JNS}
        ;!JS FindStr_StrNotFound 
      !JA FindStr_Compare ; CF=0 AND ZF=0
      
      !MOV EDX, [p.v_memEDX]
      !MOV EAX, [p.v_memEAX]
      !JNC FindStr_StrFound
      
      ;!SUB EAX, 15  ; for ASCII Strings
      !SUB EAX, 14
      !JMP FindStr_MainLoop
      
      !FindStr_StrNotFound:
        !XOR EAX, EAX
        !JMP FindStr_End
        
      !FindStr_StrFound:
        ; because EAX contains the Pointer we have to calculate the Char-No.
        !SUB EAX, [p.p_String]    ; Sub the Haystack Start-Pointer
        !SHR EAX, 1  ; Byte to Word: not needed for ASCII Strings
        !ADD EAX, 1  ; Add 1 to start with 1 as first Char-No.
      !FindStr_End:
     
      ProcedureReturn 
    CompilerEndIf  ; #PB_Compiler_32Bit
    
  CompilerElse    ; C-Backend
    
    ; for now use PB FindString. So it will work on other Platforms too.
    ; maybe provide a C optimized version in the future
    Protected *pStr.String = *String
    Protected *pStrToFind.String = *StringToFind
    
    ProcedureReturn FindString(*pStr\s, *pStrToFind\s)    
  CompilerEndIf    
  
EndProcedure

Define BigSize = 512 ; 512 Char = 1KB
Define Big$=Space(BigSize) 
Define Search$ = "search"
Define LB = Len(Search$)*2

Debug "Len(Big$)= " + Str (Len(Big$))
CopyMemory(@Search$, @Big$ + BigSize-LB-10 , LB)
Debug Search$
Debug "---- Big ----"
Debug Right(Big$, 50)
Debug "-------------"

Define pos

pos = SSE_FindStr(@Big$, @Search$)

Debug "found " + Search$ + " at Character " + pos

; do the timing test only without Debugger
CompilerIf Not #PB_Compiler_Debugger
  #Loops = (1024*1024) * 30  ; 1024*1024 => 1 GB * 30 => 30GB to search
  
  Define I, t1
  
  t1 = ElapsedMilliseconds()
  For I = 1 To #Loops
    pos = SSE_FindStr(@Big$, @Search$)
  Next
  t1 = ElapsedMilliseconds() - t1
  
  Define Msg$
  
  Msg$ = Str(t1) + " " + "ms"
  
  MessageRequester("Timing", Msg$)
CompilerEndIf

Bitblazer · Post by **Bitblazer** » Wed Mar 27, 2024 11:39 am

Maybe somebody else has the time to do a purebasic example for doing virtual file mapping on windows. If it is done right, that probably is unbeatable because it uses the MMU.

A much better and faster solution, would be during the initial file creation and not processing the complete file afterwards. There are many ways to do that even for software that you can't change / recompile, but that would require more much information.

PureBasic Forums - English

quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?

Re: quick reading a file of 15 GByte.... how?