Page 1 of 1
ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 2:17 pm
by danny88
Hi
I have a folder with thousands of html files that I need to read and process.
The script runs good and fast then suddenly the performance drops and it becomes slower and slower.
I suspect the ReadFile() function but I'm not sure.
Here is my code :
Code: Select all
Procedure.s ExtractString(str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
ProcedureReturn ""
EndIf
prefix_pos = FindString(str , prefix)
suffix_pos = FindString(str, suffix, prefix_pos)
ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
OpenConsole()
text.s = ""
cin.s = ""
If ExamineDirectory(0, "FICHES\", "*") ; folder containing subfolders
; ignore "." and ".."
NextDirectoryEntry(0)
NextDirectoryEntry(0)
While NextDirectoryEntry(0)
subfolder.s = "FICHES\" + DirectoryEntryName(0) ; subfolders containig 10k html files each
If ExamineDirectory(1, subfolder, "*.htm")
While NextDirectoryEntry(1)
counter + 1
text = ""
ReadFile(2, subfolder + "\" + DirectoryEntryName(1))
text = ReadString(2, #PB_File_IgnoreEOL)
CloseFile(2)
cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
PrintN(Str(counter) + " : " + cin)
Wend
FinishDirectory(1)
Delay(1000)
EndIf
Wend
EndIf
Input()
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 2:23 pm
by NicTheQuick
It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 2:26 pm
by danny88
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm
It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
for the delay(1000) after each subfolder, i thought i would help clear any cache as you said
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 2:36 pm
by danny88
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm
It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.
I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 3:01 pm
by Demivec
What follows is not related directly to your question concerning the use of ReadFile() but does pertain to the time for execution in the ExtractString() procedure.
In the procedure you perform two identical searches for both the prefix and suffix of the string you want to extract. You can cut this time, possibly in half, by moving the assignment of the results of these searches to before they are used in the initial conditional statement.
As it appears in your first post:
Code: Select all
Procedure.s ExtractString(str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
ProcedureReturn ""
EndIf
prefix_pos = FindString(str , prefix)
suffix_pos = FindString(str, suffix, prefix_pos)
ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
After implementing the suggested change:
Code: Select all
Procedure.s ExtractString(str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
prefix_pos = FindString(str , prefix)
If prefix_pos = 0
ProcedureReturn ""
Else
suffix_pos = FindString(str, suffix, prefix_pos)
If suffix_pos = 0
ProcedureReturn ""
EndIf
EndIf
ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 3:30 pm
by Axolotl
Hi,
I was looking at your code, and would suggest some changes...
Maybe this ends in some speed improvements as well.
BTW: 1. I like to use constants instead of numbers or strings. (especially if I use these more than once.)
2. I always check on the functions results
3. my ExtractString() implementation is is similar to that of Demivec's for the same reasons (I guess).
Code: Select all
EnableExplicit
#RootPath$ = "FICHES\"
#FileMask$ = "*.htm"
; #RootPath$ = "C:\Temp\"
; #FileMask$ = "*.txt"
#DIR_Folders = 0
#DIR_Files = 1
#FILE_Input = 2
; Procedure.s ExtractString(str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
; If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
; ProcedureReturn ""
; EndIf
; prefix_pos = FindString(str , prefix)
; suffix_pos = FindString(str, suffix, prefix_pos)
; ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
; EndProcedure
;
Procedure.s ExtractString(str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
Protected prefix_pos, suffix_pos
prefix_pos = FindString(str, prefix)
If prefix_pos > 0
suffix_pos = FindString(str, suffix, prefix_pos)
If suffix_pos > 0
ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
;Else : Debug "No Suffix found"
EndIf
;Else : Debug "No Prefix found"
EndIf
ProcedureReturn ""
EndProcedure
Procedure Main()
Protected counter
Protected text.s, cin.s, subfolder.s
If OpenConsole()
PrintN("Start App .... ")
If ExamineDirectory(#DIR_Folders, #RootPath$, "*") ; folder containing subfolders
; ; ignore "." and ".."
; NextDirectoryEntry(0)
; NextDirectoryEntry(0)
PrintN("Folder '" + #RootPath$ + "'")
While NextDirectoryEntry(#DIR_Folders)
If DirectoryEntryType(#DIR_Folders) = #PB_DirectoryEntry_Directory
If DirectoryEntryName(#DIR_Folders) = "." Or DirectoryEntryName(0) = ".."
Continue ; skip these
EndIf
subfolder = #RootPath$ + DirectoryEntryName(#DIR_Folders) ; subfolders containig 10k html files each
PrintN("Sub Folder '" + subfolder + "'")
If ExamineDirectory(#DIR_Files, subfolder, #FileMask$)
While NextDirectoryEntry(#DIR_Files)
If DirectoryEntryType(#DIR_Files) = #PB_DirectoryEntry_File
PrintN(" File '" + subfolder + "\" + DirectoryEntryName(#DIR_Files) + "'")
counter + 1
text = ""
If ReadFile(#FILE_In, subfolder + "\" + DirectoryEntryName(#DIR_Files))
text = ReadString(#FILE_In, #PB_File_IgnoreEOL)
CloseFile(#FILE_In)
cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
PrintN(Str(counter) + " : " + cin)
;Else
; PrintN("Error: Cannot read '" + DirectoryEntryName(#DIR_Files) + "' in '" + subfolder + "'")
EndIf
EndIf ; DirectoryEntryType()
Wend
FinishDirectory(#DIR_Files)
;Delay(1000)
EndIf
EndIf ; DirectoryEntryType()
Wend ; NextDirectoryEntry()
FinishDirectory(#DIR_Folders)
PrintN("Done ")
EndIf
PrintN("Enter to exit.")
Input()
Else
Debug "OpenConsole failed. "
EndIf ;
ProcedureReturn 0
EndProcedure
End Main()
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 3:30 pm
by NicTheQuick
danny88 wrote: Mon Nov 11, 2024 2:36 pm
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm
It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.
I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 4:08 pm
by danny88
NicTheQuick wrote: Mon Nov 11, 2024 3:30 pm
danny88 wrote: Mon Nov 11, 2024 2:36 pm
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm
It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.
I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
USB type C
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 4:12 pm
by NicTheQuick
danny88 wrote: Mon Nov 11, 2024 4:08 pm
NicTheQuick wrote: Mon Nov 11, 2024 3:30 pm
How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
USB type C
That's just the connector type and not the protocol. USB-C can be everything from USB 1.0 to Thunderbolt 5.
But after thinking a bit longer about it I don't think that's your issue here. I just think your SSD has not a lot of cache.
Re: ReadFile() in a loop becomes slower and slower
Posted: Mon Nov 11, 2024 7:01 pm
by AZJIO
Axolotl wrote: Mon Nov 11, 2024 3:30 pm
FindString() function returns 0 or a positive number
Re: ReadFile() in a loop becomes slower and slower
Posted: Tue Nov 12, 2024 11:03 am
by SMaag
If you deal with 10k Strings which might be long. I see some string issues too.
Code: Select all
; here you copy the complete file into a String : text
text = ReadString(2, #PB_File_IgnoreEOL)
; here at the functon call PB copies the complete text in a new virtual var, which is passed to ExtractString
cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
; Update: I did not read the post from Axlotl first. His version is the better soulution!
Procedure.s ExtractString(str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
; Findstring starts at each call from the beginning of the String, If you have a lot of long strings this needs time
If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
ProcedureReturn ""
EndIf
; here 2 more FindString
prefix_pos = FindString(str , prefix)
suffix_pos = FindString(str, suffix, prefix_pos)
ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
to speed up String operations you can use a Pointer for ExtractString, and eliminate 2 FindString operations
Code: Select all
Procedure.s ExtractString(@str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
prefix_pos = FindString(str , prefix)
suffix_pos = FindString(str, suffix)
If prefix_pos And suffix_pos
ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
Else
ProcedureReturn ""
EndIf
EndProcedure
You can eliminate the Len(prefix) in the Mid() Function too, because you calculate 10k*2 times Len(prefix) what is always the same.
Pass the Len_prefix as extra Parameter to ExtractString()
But I guess the bottleneck is to read the directory
Re: ReadFile() in a loop becomes slower and slower
Posted: Tue Nov 12, 2024 11:15 am
by SMaag
I have a question about the ~
ExtractString(text, ~"cin\" value=\"", ~"\" id=")
what is the function of ~
I have never seen that. And I could not find it in the PB-Help
if I do a debug of it
Code: Select all
Procedure.s ExtractString(str.s, prefix.s, suffix.s) ; function to extract a text between two delimiters
Debug prefix
Debug suffix
EndProcedure
text.s ="Text"
cin$ = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
i get:
cin" value="
" id=
Re: ReadFile() in a loop becomes slower and slower
Posted: Tue Nov 12, 2024 11:56 am
by NicTheQuick
With ~ you can use
escaped characters inside a string.
It is mentioned in the help but I can not find it either right now. In this regard the help is not very useful. Some pages seem hidden and the search function is very bad since ever.
Re: ReadFile() in a loop becomes slower and slower
Posted: Tue Nov 12, 2024 1:13 pm
by HeX0R