@wimapon
That's same problem I have. Very huge industrial production log-Files in CSV-style to analyse. For that purpose I started a FastString and a CSV Project. But it is still under dvelopment. Maybe we can workt together on this!
The way to solve that problem is:
1. Read the Files in Blocks into Memory. Check availible RAM and decide how much you can use. Best is to use 4k aligned reads because that's the sektor size from HDD and the page size of x64 OS.
For a universal Buffer handling I wrote a module which I want to use for CSV and BigString handling!
https://github.com/Maagic7/PureBasicFra ... _Buffer.pb
2. Search in the block for your String.
At the postion found, you have to search for LineStart and LineEnd
- to get the Start of the line search reverse from the found string for EndOfLine
- to get the End of the line search forward for EndOfLine
- Copy the Line out of the memory
!EndOfLine might be different on Windows and Linux!
Here is the link to the SSE-String functions
This needs SSE 4.2 support
https://www.purebasic.fr/english/viewtopic.php?t=83316
https://github.com/Maagic7/PureBasicFra ... ringSSE.pb
an other FastString Module, wich works with standard SSE is here
https://github.com/Maagic7/PureBasicFra ... tString.pb
That's all code I wrote for industrial BigString handling and GB sized files!
But for the moment it's a low priority project for me.
quick reading a file of 15 GByte.... how?
-
- Enthusiast
- Posts: 290
- Joined: Thu Dec 16, 2010 2:05 pm
- Location: Delfzijl ( The Netherlands )
- Contact:
Re: quick reading a file of 15 GByte.... how?
Okay folks,
i will experiment whith that.
so thank you very much for your help.
It is always very nice to get help when you have a problem !!!
Wim
i will experiment whith that.
so thank you very much for your help.
It is always very nice to get help when you have a problem !!!
Wim
Re: quick reading a file of 15 GByte.... how?
How about reading a huge block and parsing it with RegEx? Then read the next block and SK on. The only thing to do is make sure the end of the block ends with a line break (move the file pointer back to the last line break for the next read operation) and test it with several block sizes.
Good morning, that's a nice tnetennba!
PureBasic 6.21/Windows 11 x64/Ryzen 7900X/32GB RAM/3TB SSD
Synology DS1821+/DX517, 130.9TB+50.8TB+2TB SSD
PureBasic 6.21/Windows 11 x64/Ryzen 7900X/32GB RAM/3TB SSD
Synology DS1821+/DX517, 130.9TB+50.8TB+2TB SSD
Re: quick reading a file of 15 GByte.... how?
It's impossible to understand! Do you need to process from 2 to 10 files of 15 gigabytes each? Or are you processing the same file multiple times?wimapon wrote: Mon Mar 25, 2024 12:37 pm It depends of the kind of processing i am doing.
something between 2 and 10 times a day when i am processing.
And it is rather irritating to wait 2 minutes when you are busy and thinking.
You wrote that you are making a selection according to the "PA0SLT" criterion and this is enough
for you.
P.S. I am almost sure that you will inevitably come to the need to apply some kind of indexing mechanism.
P.P.S. In this case, indexing can be done in the process of data accumulation, and not before the analysis itself.
Last edited by useful on Mon Mar 25, 2024 1:24 pm, edited 1 time in total.
Dawn will come inevitably.
- NicTheQuick
- Addict
- Posts: 1523
- Joined: Sun Jun 22, 2003 7:43 pm
- Location: Germany, Saarbrücken
- Contact:
Re: quick reading a file of 15 GByte.... how?
I just tried something similar with GNU grep and needed 1 minute and 24 seconds for finding a small string in a 19 GiB file which means that it was able to search through it with ~230 MiB/s.
That was on a quite fast PC (see my profile). Interestingly the CPU time needed was just 2,8 seconds for searching and 3,9 seconds for kernel calls. I am not sure what took the rest of the time. Since my NVME SSD is capable of reading 7 GiB/s it is definitely not the bottle neck.
Anyway, GNU grep (source here) is usually quite efficient with finding stuff and outputting the line with the match. A lot of brain juice was put into that tool. So I don't think it will get much faster than this without preprocessing and indexing your data.
Another idea: What about creating a background job that is constantly waiting for these kind of files you are talking about and start processing them. Or does the search pattern change every time? There are different search algorithm out there that can boost performance in certain circumstances. Like for example if you are searching the same thing over and over you can use the Aho–Corasick algorithm, and if you want to search different patterns on the same data a Suffix-Array-Induced-Sorting could be helpful.
That was on a quite fast PC (see my profile). Interestingly the CPU time needed was just 2,8 seconds for searching and 3,9 seconds for kernel calls. I am not sure what took the rest of the time. Since my NVME SSD is capable of reading 7 GiB/s it is definitely not the bottle neck.
Anyway, GNU grep (source here) is usually quite efficient with finding stuff and outputting the line with the match. A lot of brain juice was put into that tool. So I don't think it will get much faster than this without preprocessing and indexing your data.
Another idea: What about creating a background job that is constantly waiting for these kind of files you are talking about and start processing them. Or does the search pattern change every time? There are different search algorithm out there that can boost performance in certain circumstances. Like for example if you are searching the same thing over and over you can use the Aho–Corasick algorithm, and if you want to search different patterns on the same data a Suffix-Array-Induced-Sorting could be helpful.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Re: quick reading a file of 15 GByte.... how?
You can also try this:
But it depends on:
1. File is stored in ASCII format
2. EOL are 2 characters
You can increase the buffer size constant to improve speed.
Code: Select all
EnableExplicit
#BufferSize = 65535
Define.i InFile, OutFile, ReadLen, Offset, Found, State
Define *Buffer, *BufferEnd, *Ptr.Ascii, *LastEOL
Define Filename$, Line$
Filename$ = OpenFileRequester("Choose a CSV file", "", "CSV|*.csv", 0)
If Filename$
InFile = ReadFile(#PB_Any, Filename$)
If InFile
OutFile = CreateFile(#PB_Any, Filename$ + ".extract.csv")
If OutFile
*Buffer = AllocateMemory(#BufferSize, #PB_Memory_NoClear)
If *Buffer
*LastEOL = *Buffer
While Not Eof(InFile)
ReadLen = ReadData(InFile, *Buffer + Offset, MemorySize(*Buffer) - Offset)
;Debug ReadLen
*BufferEnd = *Buffer + ReadLen
*Ptr = *Buffer
While *Ptr < *BufferEnd
If *Ptr\a < ' '
If Found
WriteData(OutFile, *LastEOL, *Ptr - *LastEOL)
WriteStringN(OutFile, "")
Found = #False
EndIf
*Ptr + 1
*LastEOL = *Ptr + 1
Else
Select State
Case 0
If *Ptr\a = 'P'
State + 1
EndIf
Case 1
If *Ptr\a = 'A'
State + 1
Else
State = 0
EndIf
Case 2
If *Ptr\a = '0'
State + 1
Else
State = 0
EndIf
Case 3
If *Ptr\a = 'S'
State + 1
Else
State = 0
EndIf
Case 4
If *Ptr\a = 'L'
State + 1
Else
State = 0
EndIf
Case 5
If *Ptr\a = 'T'
Found = #True
EndIf
State = 0
EndSelect
EndIf
*Ptr + 1
Wend
Offset = *Ptr - *LastEOL
CopyMemory(*LastEOL, *Buffer, Offset)
*LastEOL = *Buffer
Wend
FreeMemory(*Buffer)
EndIf
CloseFile(OutFile)
EndIf
CloseFile(InFile)
EndIf
EndIf
1. File is stored in ASCII format
2. EOL are 2 characters
You can increase the buffer size constant to improve speed.
Re: quick reading a file of 15 GByte.... how?
here is a timing demo. If i did it right, it is a search in a 512 Char = 1kB String
the search string is at the end of the Big$
Then do this for Loops = (1024*1024) * 30
should be 30GB String to search in total
My Ryzen 5800 needs 492ms to do this!
It seems to be to fast for me. I hope there is no bug!
the search string is at the end of the Big$
Then do this for Loops = (1024*1024) * 30
should be 30GB String to search in total
My Ryzen 5800 needs 492ms to do this!
It seems to be to fast for me. I hope there is no bug!
Code: Select all
EnableExplicit
Procedure.i SSE_FindStr(*String, *StringToFind)
; Attention: This function is in beta state!
; ============================================================================
; NAME: SSE_FindStr
; DESC: Try to find StringToFind in String with SSE operation (PCmpIStrI)
; DESC: Search for the needle in the haystack
; DESC: This Function is for 2Byte Character Strings only
; VAR(*String): Pointer to String (Haystack)
; VAR(*StringToFind): Pointer to StringToFind (Needle)
; RET.i: If found: The startposition in Characters [1..n]. Otherwise 0
; ============================================================================
;DisableDebugger
; TODO! Solve the 16Byte align problem
CompilerIf #PB_Compiler_Backend = #PB_Backend_Asm
CompilerIf #PB_Compiler_64Bit
Protected memRAX, memRDX
; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1.
; The matching process does not include the terminating null-characters, but it stops there
; RAX = haystack (Heuhaufen), RDX = needle (Nadel)
; XMM0 XMM1 XMM2 XMM3 XMM4
; XMM1 = [String1] : XMM2=[String2]
!MOV RAX, [p.p_String] ; haystack
!MOV RDX, [p.p_StringToFind] ; needle
!MOVDQU XMM2, DQWORD[RDX] ; load the first 16 bytes of neddle (String to find)
!SUB RAX, 16 ; Avoid extra jump in main loop
; ----------------------------------------------------------------------
; Find the first possible match of 16-byte fragment in haystack
; ----------------------------------------------------------------------
!FindStr_MainLoop:
!ADD RAX, 16 ; Step up Counter
!MOVDQU XMM1, DQWORD[RAX]
;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
!PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
; now RCX contains the offset in WORDS where a match was found
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(RAX) ZF=1
; 2) We find a char that does not match CF=1
!JA FindStr_MainLoop
; Jump if CF=0, we found only matching chars
!JNC FindStr_StrNotFound
; possible match found at WordOffset in RCX
!ADD RCX, RCX ; Word to Byte
!ADD RAX, RCX ; save the possible match start
!MOV [p.v_memRDX], RDX ; mov edi, edx; save RDX
!MOV [p.v_memRAX], RAX ; mov esi, eax; save RAX
; ----------------------------------------------------------------------
; Compare String, at possible match postion in haystack, with needle
; ----------------------------------------------------------------------
!SUB RDX, RAX
!SUB RAX, 16 ; counter
!PXOR XMM3, XMM3 ; XMM3 = 0
; compare the strings
!FindStr_Compare:
!ADD RAX, 16 ; Counter
!MOVDQU XMM1, DQWORD[RAX+RDX] ; Haystack
; mask out invalid bytes in the haystack
;!PCMPISTRM XMM3, XMM1, 1011000b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; for ASCII Strings
!PCMPISTRM XMM3, XMM1, 1011001b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
!MOVDQU XMM4, DQWORD[RAX] ; haystack
!PAND XMM4, XMM0
;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
!PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(RDX+RCX) ZF=1 {JA CF=0 & ZF=0} {JE : ZF=1)
; 2) We find a char that does not match CF=1 {JC, JNC}
; 3) We find a null in s2 SF=1 {JS, JNS}
;!JS FindStr_StrNotFound
!JA FindStr_Compare ; CF=0 AND ZF=0
!MOV RDX, [p.v_memRDX]
!MOV RAX, [p.v_memRAX]
!JNC FindStr_StrFound
;!SUB RAX, 15 ; for ASCII Strings
!SUB RAX, 14
!JMP FindStr_MainLoop
!FindStr_StrNotFound:
!XOR RAX, RAX
!JMP FindStr_End
!FindStr_StrFound:
; because RAX contains the Pointer we have to calculate the Char-No.
!SUB RAX, [p.p_String] ; Sub the Haystack Start-Pointer
!SHR RAX, 1 ; Byte to Word: not needed for ASCII Strings
!ADD RAX, 1 ; Add 1 to start with 1 as first Char-No.
!FindStr_End:
ProcedureReturn ; !RAX
CompilerElse ; #PB_Compiler_32Bit
Protected memEAX, memEDX
; Returns a pointer To the first occurrence of str2 in str1, Or a null pointer If str2 is Not part of str1.
; The matching process does not include the terminating null-characters, but it stops there
; RAX = haystack (Heuhaufen), EDX = needle (Nadel)
; XMM0 XMM1 XMM2 XMM3 XMM4
; XMM1 = [String1] : XMM2=[String2]
!MOV EAX, [p.p_String] ; haystack
!MOV EDX, [p.p_StringToFind] ; needle
!MOVDQU XMM2, DQWORD[EDX] ; load the first 16 bytes of neddle (String to find)
!SUB EAX, 16 ; Avoid extra jump in main loop
; ----------------------------------------------------------------------
; Find the first possible match of 16-byte fragment in haystack
; ----------------------------------------------------------------------
!FindStr_MainLoop:
!ADD EAX, 16 ; Step up Counter
!MOVDQU XMM1, DQWORD[EAX]
;!PCMPISTRI XMM2, XMM1, 1100b ; EQUAL_ORDERED ; for ASCII Strings
!PCMPISTRI XMM2, XMM1, 1101b ; EQUAL_ORDERED + UNSIGNED_WORDS; 11001b
; now RCX contains the offset in WORDS where a match was found
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(EAX) ZF=1
; 2) We find a char that does not match CF=1
!JA FindStr_MainLoop
; Jump if CF=0, we found only matching chars
!JNC FindStr_StrNotFound
; possible match found at WordOffset in ECX
!ADD ECX, ECX ; Word to Byte
!ADD EAX, ECX ; save the possible match start
!MOV [p.v_memEDX], EDX ; mov edi, edx; save EDX
!MOV [p.v_memEAX], EAX ; mov esi, eax; save EAX
; ----------------------------------------------------------------------
; Compare String, at possible match postion in haystack, with needle
; ----------------------------------------------------------------------
!SUB EDX, EAX
!SUB EAX, 16 ; counter
!PXOR XMM3, XMM3 ; XMM3 = 0
; compare the strings
!FindStr_Compare:
!ADD EAX, 16 ; Counter
!MOVDQU XMM1, DQWORD[EAX+EDX] ; Haystack
; mask out invalid bytes in the haystack
;!PCMPISTRM XMM3, XMM1, 1011000b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; for ASCII Strings
!PCMPISTRM XMM3, XMM1, 1011001b ; EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK + UNSIGNED_WORDS
; PCMPISTRM writes as result a Mask To XMM0, we used BYTE_MASK
!MOVDQU XMM4, DQWORD[EAX] ; haystack
!PAND XMM4, XMM0
;!PCMPISTRI XMM1, XMM4, 0011000b ; EQUAL_EACH + NEGATIVE_POLARITY ; for ASCII Strings
!PCMPISTRI XMM1, XMM4, 0011001b ; EQUAL_EACH + NEGATIVE_POLARITY + UNSIGNED_WORDS
; Loop while ZF=0 and CF=0:
; 1) We find a null in s1(EDX+ECX) ZF=1 {JA CF=0 & ZF=0} {JE : ZF=1)
; 2) We find a char that does not match CF=1 {JC, JNC}
; 3) We find a null in s2 SF=1 {JS, JNS}
;!JS FindStr_StrNotFound
!JA FindStr_Compare ; CF=0 AND ZF=0
!MOV EDX, [p.v_memEDX]
!MOV EAX, [p.v_memEAX]
!JNC FindStr_StrFound
;!SUB EAX, 15 ; for ASCII Strings
!SUB EAX, 14
!JMP FindStr_MainLoop
!FindStr_StrNotFound:
!XOR EAX, EAX
!JMP FindStr_End
!FindStr_StrFound:
; because EAX contains the Pointer we have to calculate the Char-No.
!SUB EAX, [p.p_String] ; Sub the Haystack Start-Pointer
!SHR EAX, 1 ; Byte to Word: not needed for ASCII Strings
!ADD EAX, 1 ; Add 1 to start with 1 as first Char-No.
!FindStr_End:
ProcedureReturn
CompilerEndIf ; #PB_Compiler_32Bit
CompilerElse ; C-Backend
; for now use PB FindString. So it will work on other Platforms too.
; maybe provide a C optimized version in the future
Protected *pStr.String = *String
Protected *pStrToFind.String = *StringToFind
ProcedureReturn FindString(*pStr\s, *pStrToFind\s)
CompilerEndIf
EndProcedure
Define BigSize = 512 ; 512 Char = 1KB
Define Big$=Space(BigSize)
Define Search$ = "search"
Define LB = Len(Search$)*2
Debug "Len(Big$)= " + Str (Len(Big$))
CopyMemory(@Search$, @Big$ + BigSize-LB-10 , LB)
Debug Search$
Debug "---- Big ----"
Debug Right(Big$, 50)
Debug "-------------"
Define pos
pos = SSE_FindStr(@Big$, @Search$)
Debug "found " + Search$ + " at Character " + pos
; do the timing test only without Debugger
CompilerIf Not #PB_Compiler_Debugger
#Loops = (1024*1024) * 30 ; 1024*1024 => 1 GB * 30 => 30GB to search
Define I, t1
t1 = ElapsedMilliseconds()
For I = 1 To #Loops
pos = SSE_FindStr(@Big$, @Search$)
Next
t1 = ElapsedMilliseconds() - t1
Define Msg$
Msg$ = Str(t1) + " " + "ms"
MessageRequester("Timing", Msg$)
CompilerEndIf
Re: quick reading a file of 15 GByte.... how?
Maybe somebody else has the time to do a purebasic example for doing virtual file mapping on windows. If it is done right, that probably is unbeatable because it uses the MMU.
A much better and faster solution, would be during the initial file creation and not processing the complete file afterwards. There are many ways to do that even for software that you can't change / recompile, but that would require more much information.
A much better and faster solution, would be during the initial file creation and not processing the complete file afterwards. There are many ways to do that even for software that you can't change / recompile, but that would require more much information.