ReadFile() in a loop becomes slower and slower

Just starting out? Need help? Post your questions and find answers here.
danny88
User
User
Posts: 38
Joined: Sun Jan 21, 2024 8:13 am

ReadFile() in a loop becomes slower and slower

Post by danny88 »

Hi

I have a folder with thousands of html files that I need to read and process.

The script runs good and fast then suddenly the performance drops and it becomes slower and slower.
I suspect the ReadFile() function but I'm not sure.

Here is my code :

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
    ProcedureReturn ""
  EndIf
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix, prefix_pos)
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure

OpenConsole()
text.s = ""
cin.s = ""

If ExamineDirectory(0, "FICHES\", "*")	; folder containing subfolders 
  ; ignore "." and ".."
  NextDirectoryEntry(0)
  NextDirectoryEntry(0)
  
  While NextDirectoryEntry(0)
    subfolder.s = "FICHES\" + DirectoryEntryName(0)	; subfolders containig 10k html files each
    
    If ExamineDirectory(1, subfolder, "*.htm")
      While NextDirectoryEntry(1)
        counter + 1
        text = ""
        ReadFile(2, subfolder + "\" + DirectoryEntryName(1))
        text = ReadString(2, #PB_File_IgnoreEOL)
        CloseFile(2)
        cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
        PrintN(Str(counter) + " : " + cin)
      Wend
      FinishDirectory(1)
      Delay(1000)
    EndIf
  Wend
EndIf

Input()

User avatar
NicTheQuick
Addict
Addict
Posts: 1504
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReadFile() in a loop becomes slower and slower

Post by NicTheQuick »

It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
danny88
User
User
Posts: 38
Joined: Sun Jan 21, 2024 8:13 am

Re: ReadFile() in a loop becomes slower and slower

Post by danny88 »

NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
for the delay(1000) after each subfolder, i thought i would help clear any cache as you said
danny88
User
User
Posts: 38
Joined: Sun Jan 21, 2024 8:13 am

Re: ReadFile() in a loop becomes slower and slower

Post by danny88 »

NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.

I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: ReadFile() in a loop becomes slower and slower

Post by Demivec »

What follows is not related directly to your question concerning the use of ReadFile() but does pertain to the time for execution in the ExtractString() procedure.

In the procedure you perform two identical searches for both the prefix and suffix of the string you want to extract. You can cut this time, possibly in half, by moving the assignment of the results of these searches to before they are used in the initial conditional statement.

As it appears in your first post:

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
    ProcedureReturn ""
  EndIf
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix, prefix_pos)
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
After implementing the suggested change:

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  prefix_pos = FindString(str , prefix)
  If prefix_pos = 0
     ProcedureReturn ""
  Else
     suffix_pos = FindString(str, suffix, prefix_pos)
     If suffix_pos = 0 
       ProcedureReturn ""
     EndIf
  EndIf
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
Axolotl
Addict
Addict
Posts: 802
Joined: Wed Dec 31, 2008 3:36 pm

Re: ReadFile() in a loop becomes slower and slower

Post by Axolotl »

Hi,
I was looking at your code, and would suggest some changes...
Maybe this ends in some speed improvements as well.
BTW: 1. I like to use constants instead of numbers or strings. (especially if I use these more than once.)
2. I always check on the functions results
3. my ExtractString() implementation is is similar to that of Demivec's for the same reasons (I guess).

Code: Select all

EnableExplicit 

#RootPath$ = "FICHES\" 
#FileMask$ = "*.htm"
; #RootPath$ = "C:\Temp\" 
; #FileMask$ = "*.txt"

#DIR_Folders  = 0 
#DIR_Files    = 1 
#FILE_Input   = 2 


; Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
;   If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
;     ProcedureReturn ""
;   EndIf
;   prefix_pos = FindString(str , prefix)
;   suffix_pos = FindString(str, suffix, prefix_pos)
;   ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
; EndProcedure
; 

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  Protected prefix_pos, suffix_pos  

  prefix_pos = FindString(str, prefix) 
  If prefix_pos > 0 
    suffix_pos = FindString(str, suffix, prefix_pos) 
    If suffix_pos > 0 
      ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
   ;Else : Debug "No Suffix found" 
    EndIf 
 ;Else : Debug "No Prefix found" 
  EndIf 
  ProcedureReturn "" 
EndProcedure 

Procedure Main() 
  Protected counter 
  Protected text.s, cin.s, subfolder.s 
  
  If OpenConsole()  

    PrintN("Start App .... ") 

    If ExamineDirectory(#DIR_Folders, #RootPath$, "*")	; folder containing subfolders 
;   ; ignore "." and ".."
;   NextDirectoryEntry(0)
;   NextDirectoryEntry(0)

      PrintN("Folder '" + #RootPath$ + "'") 
      While NextDirectoryEntry(#DIR_Folders)
        If DirectoryEntryType(#DIR_Folders) = #PB_DirectoryEntry_Directory 
          If DirectoryEntryName(#DIR_Folders) = "." Or DirectoryEntryName(0) = ".." 
            Continue  ; skip these 
          EndIf 
  
          subfolder = #RootPath$ + DirectoryEntryName(#DIR_Folders)	; subfolders containig 10k html files each
          PrintN("Sub Folder '" + subfolder + "'") 

          If ExamineDirectory(#DIR_Files, subfolder, #FileMask$) 
            While NextDirectoryEntry(#DIR_Files)
              If DirectoryEntryType(#DIR_Files) = #PB_DirectoryEntry_File 
                PrintN("  File '" + subfolder + "\" + DirectoryEntryName(#DIR_Files) + "'") 

                counter + 1
                text = ""
                If ReadFile(#FILE_In, subfolder + "\" + DirectoryEntryName(#DIR_Files))
                  text = ReadString(#FILE_In, #PB_File_IgnoreEOL)
                  CloseFile(#FILE_In)
                  cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
                  PrintN(Str(counter) + " : " + cin)
               ;Else 
               ;  PrintN("Error: Cannot read '" + DirectoryEntryName(#DIR_Files) + "' in '" + subfolder + "'") 
                EndIf 

              EndIf ; DirectoryEntryType() 
            Wend 
            FinishDirectory(#DIR_Files)
           ;Delay(1000)
          EndIf

        EndIf ; DirectoryEntryType() 
      Wend ; NextDirectoryEntry() 
      FinishDirectory(#DIR_Folders) 

      PrintN("Done ") 
    EndIf

    PrintN("Enter to exit.") 
    Input()

  Else 
    Debug "OpenConsole failed. " 
  EndIf ; 
  ProcedureReturn 0 
EndProcedure 

End Main() 
Just because it worked doesn't mean it works.
PureBasic 6.04 (x86) and <latest stable version and current alpha/beta> (x64) on Windows 11 Home. Now started with Linux (VM: Ubuntu 22.04).
User avatar
NicTheQuick
Addict
Addict
Posts: 1504
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReadFile() in a loop becomes slower and slower

Post by NicTheQuick »

danny88 wrote: Mon Nov 11, 2024 2:36 pm
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.

I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
danny88
User
User
Posts: 38
Joined: Sun Jan 21, 2024 8:13 am

Re: ReadFile() in a loop becomes slower and slower

Post by danny88 »

NicTheQuick wrote: Mon Nov 11, 2024 3:30 pm
danny88 wrote: Mon Nov 11, 2024 2:36 pm
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.

I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
USB type C
User avatar
NicTheQuick
Addict
Addict
Posts: 1504
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReadFile() in a loop becomes slower and slower

Post by NicTheQuick »

danny88 wrote: Mon Nov 11, 2024 4:08 pm
NicTheQuick wrote: Mon Nov 11, 2024 3:30 pm How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
USB type C
That's just the connector type and not the protocol. USB-C can be everything from USB 1.0 to Thunderbolt 5. :wink:

But after thinking a bit longer about it I don't think that's your issue here. I just think your SSD has not a lot of cache.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
AZJIO
Addict
Addict
Posts: 2143
Joined: Sun May 14, 2017 1:48 am

Re: ReadFile() in a loop becomes slower and slower

Post by AZJIO »

Axolotl wrote: Mon Nov 11, 2024 3:30 pm

Code: Select all

  If prefix_pos > 0

Code: Select all

  If prefix_pos
FindString() function returns 0 or a positive number
SMaag
Enthusiast
Enthusiast
Posts: 303
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: ReadFile() in a loop becomes slower and slower

Post by SMaag »

If you deal with 10k Strings which might be long. I see some string issues too.

Code: Select all


	; here you copy the complete file into a String : text
       text = ReadString(2, #PB_File_IgnoreEOL)
       
       ;  here at the functon call PB copies the complete text in a new virtual var, which is passed to ExtractString
       cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
   
   ; Update: I did not read the post from Axlotl first. His version is the better soulution!    
 Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
 ; Findstring starts at each call from the beginning of the String, If you have a lot of long strings this needs time
  If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
    ProcedureReturn ""
  EndIf
  
  ; here 2 more FindString
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix, prefix_pos)
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
 	
to speed up String operations you can use a Pointer for ExtractString, and eliminate 2 FindString operations

Code: Select all

Procedure.s ExtractString(@str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix)
  
  If prefix_pos And suffix_pos 
    ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
  Else
    ProcedureReturn ""
  EndIf
  
EndProcedure

You can eliminate the Len(prefix) in the Mid() Function too, because you calculate 10k*2 times Len(prefix) what is always the same.
Pass the Len_prefix as extra Parameter to ExtractString()
But I guess the bottleneck is to read the directory
SMaag
Enthusiast
Enthusiast
Posts: 303
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: ReadFile() in a loop becomes slower and slower

Post by SMaag »

I have a question about the ~
ExtractString(text, ~"cin\" value=\"", ~"\" id=")

what is the function of ~
I have never seen that. And I could not find it in the PB-Help

if I do a debug of it

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  Debug prefix
  Debug suffix
EndProcedure

text.s ="Text"
cin$ = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
i get:
cin" value="
" id=
User avatar
NicTheQuick
Addict
Addict
Posts: 1504
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: ReadFile() in a loop becomes slower and slower

Post by NicTheQuick »

With ~ you can use escaped characters inside a string.
It is mentioned in the help but I can not find it either right now. In this regard the help is not very useful. Some pages seem hidden and the search function is very bad since ever.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
HeX0R
Addict
Addict
Posts: 1189
Joined: Mon Sep 20, 2004 7:12 am
Location: Hell

Re: ReadFile() in a loop becomes slower and slower

Post by HeX0R »

Post Reply