Page 1 of 1

ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 2:17 pm
by danny88
Hi

I have a folder with thousands of html files that I need to read and process.

The script runs good and fast then suddenly the performance drops and it becomes slower and slower.
I suspect the ReadFile() function but I'm not sure.

Here is my code :

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
    ProcedureReturn ""
  EndIf
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix, prefix_pos)
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure

OpenConsole()
text.s = ""
cin.s = ""

If ExamineDirectory(0, "FICHES\", "*")	; folder containing subfolders 
  ; ignore "." and ".."
  NextDirectoryEntry(0)
  NextDirectoryEntry(0)
  
  While NextDirectoryEntry(0)
    subfolder.s = "FICHES\" + DirectoryEntryName(0)	; subfolders containig 10k html files each
    
    If ExamineDirectory(1, subfolder, "*.htm")
      While NextDirectoryEntry(1)
        counter + 1
        text = ""
        ReadFile(2, subfolder + "\" + DirectoryEntryName(1))
        text = ReadString(2, #PB_File_IgnoreEOL)
        CloseFile(2)
        cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
        PrintN(Str(counter) + " : " + cin)
      Wend
      FinishDirectory(1)
      Delay(1000)
    EndIf
  Wend
EndIf

Input()


Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 2:23 pm
by NicTheQuick
It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 2:26 pm
by danny88
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
for the delay(1000) after each subfolder, i thought i would help clear any cache as you said

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 2:36 pm
by danny88
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.

I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 3:01 pm
by Demivec
What follows is not related directly to your question concerning the use of ReadFile() but does pertain to the time for execution in the ExtractString() procedure.

In the procedure you perform two identical searches for both the prefix and suffix of the string you want to extract. You can cut this time, possibly in half, by moving the assignment of the results of these searches to before they are used in the initial conditional statement.

As it appears in your first post:

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
    ProcedureReturn ""
  EndIf
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix, prefix_pos)
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
After implementing the suggested change:

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  prefix_pos = FindString(str , prefix)
  If prefix_pos = 0
     ProcedureReturn ""
  Else
     suffix_pos = FindString(str, suffix, prefix_pos)
     If suffix_pos = 0 
       ProcedureReturn ""
     EndIf
  EndIf
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 3:30 pm
by Axolotl
Hi,
I was looking at your code, and would suggest some changes...
Maybe this ends in some speed improvements as well.
BTW: 1. I like to use constants instead of numbers or strings. (especially if I use these more than once.)
2. I always check on the functions results
3. my ExtractString() implementation is is similar to that of Demivec's for the same reasons (I guess).

Code: Select all

EnableExplicit 

#RootPath$ = "FICHES\" 
#FileMask$ = "*.htm"
; #RootPath$ = "C:\Temp\" 
; #FileMask$ = "*.txt"

#DIR_Folders  = 0 
#DIR_Files    = 1 
#FILE_Input   = 2 


; Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
;   If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
;     ProcedureReturn ""
;   EndIf
;   prefix_pos = FindString(str , prefix)
;   suffix_pos = FindString(str, suffix, prefix_pos)
;   ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
; EndProcedure
; 

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  Protected prefix_pos, suffix_pos  

  prefix_pos = FindString(str, prefix) 
  If prefix_pos > 0 
    suffix_pos = FindString(str, suffix, prefix_pos) 
    If suffix_pos > 0 
      ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
   ;Else : Debug "No Suffix found" 
    EndIf 
 ;Else : Debug "No Prefix found" 
  EndIf 
  ProcedureReturn "" 
EndProcedure 

Procedure Main() 
  Protected counter 
  Protected text.s, cin.s, subfolder.s 
  
  If OpenConsole()  

    PrintN("Start App .... ") 

    If ExamineDirectory(#DIR_Folders, #RootPath$, "*")	; folder containing subfolders 
;   ; ignore "." and ".."
;   NextDirectoryEntry(0)
;   NextDirectoryEntry(0)

      PrintN("Folder '" + #RootPath$ + "'") 
      While NextDirectoryEntry(#DIR_Folders)
        If DirectoryEntryType(#DIR_Folders) = #PB_DirectoryEntry_Directory 
          If DirectoryEntryName(#DIR_Folders) = "." Or DirectoryEntryName(0) = ".." 
            Continue  ; skip these 
          EndIf 
  
          subfolder = #RootPath$ + DirectoryEntryName(#DIR_Folders)	; subfolders containig 10k html files each
          PrintN("Sub Folder '" + subfolder + "'") 

          If ExamineDirectory(#DIR_Files, subfolder, #FileMask$) 
            While NextDirectoryEntry(#DIR_Files)
              If DirectoryEntryType(#DIR_Files) = #PB_DirectoryEntry_File 
                PrintN("  File '" + subfolder + "\" + DirectoryEntryName(#DIR_Files) + "'") 

                counter + 1
                text = ""
                If ReadFile(#FILE_In, subfolder + "\" + DirectoryEntryName(#DIR_Files))
                  text = ReadString(#FILE_In, #PB_File_IgnoreEOL)
                  CloseFile(#FILE_In)
                  cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
                  PrintN(Str(counter) + " : " + cin)
               ;Else 
               ;  PrintN("Error: Cannot read '" + DirectoryEntryName(#DIR_Files) + "' in '" + subfolder + "'") 
                EndIf 

              EndIf ; DirectoryEntryType() 
            Wend 
            FinishDirectory(#DIR_Files)
           ;Delay(1000)
          EndIf

        EndIf ; DirectoryEntryType() 
      Wend ; NextDirectoryEntry() 
      FinishDirectory(#DIR_Folders) 

      PrintN("Done ") 
    EndIf

    PrintN("Enter to exit.") 
    Input()

  Else 
    Debug "OpenConsole failed. " 
  EndIf ; 
  ProcedureReturn 0 
EndProcedure 

End Main() 

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 3:30 pm
by NicTheQuick
danny88 wrote: Mon Nov 11, 2024 2:36 pm
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.

I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 4:08 pm
by danny88
NicTheQuick wrote: Mon Nov 11, 2024 3:30 pm
danny88 wrote: Mon Nov 11, 2024 2:36 pm
NicTheQuick wrote: Mon Nov 11, 2024 2:23 pm It's most likely the file cache of your system, either your operating systems file cache or the cache in your harddrive or ssd. At first it's fast and after a while it is getting slower.
Btw. Why do you have a `Delay(1000)` in there?
You are right. I was working on an external SSD.

I've just tried on internal SSD and it works just fine. Is there a way to avoid the limitation on the external SSD ?
How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
USB type C

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 4:12 pm
by NicTheQuick
danny88 wrote: Mon Nov 11, 2024 4:08 pm
NicTheQuick wrote: Mon Nov 11, 2024 3:30 pm How is that external SSD connected? How big is its cache if it even has one? Use a fast USB connection like USB 3.2, USB 4.0 if available or Thunderbolt.
USB type C
That's just the connector type and not the protocol. USB-C can be everything from USB 1.0 to Thunderbolt 5. :wink:

But after thinking a bit longer about it I don't think that's your issue here. I just think your SSD has not a lot of cache.

Re: ReadFile() in a loop becomes slower and slower

Posted: Mon Nov 11, 2024 7:01 pm
by AZJIO
Axolotl wrote: Mon Nov 11, 2024 3:30 pm

Code: Select all

  If prefix_pos > 0

Code: Select all

  If prefix_pos
FindString() function returns 0 or a positive number

Re: ReadFile() in a loop becomes slower and slower

Posted: Tue Nov 12, 2024 11:03 am
by SMaag
If you deal with 10k Strings which might be long. I see some string issues too.

Code: Select all


	; here you copy the complete file into a String : text
       text = ReadString(2, #PB_File_IgnoreEOL)
       
       ;  here at the functon call PB copies the complete text in a new virtual var, which is passed to ExtractString
       cin = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
   
   ; Update: I did not read the post from Axlotl first. His version is the better soulution!    
 Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
 ; Findstring starts at each call from the beginning of the String, If you have a lot of long strings this needs time
  If FindString(str, prefix) = 0 Or FindString(str, suffix) = 0
    ProcedureReturn ""
  EndIf
  
  ; here 2 more FindString
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix, prefix_pos)
  ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
EndProcedure
 	
to speed up String operations you can use a Pointer for ExtractString, and eliminate 2 FindString operations

Code: Select all

Procedure.s ExtractString(@str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  prefix_pos = FindString(str , prefix)
  suffix_pos = FindString(str, suffix)
  
  If prefix_pos And suffix_pos 
    ProcedureReturn Mid(str, prefix_pos + Len(prefix), suffix_pos - prefix_pos - Len(prefix))
  Else
    ProcedureReturn ""
  EndIf
  
EndProcedure

You can eliminate the Len(prefix) in the Mid() Function too, because you calculate 10k*2 times Len(prefix) what is always the same.
Pass the Len_prefix as extra Parameter to ExtractString()
But I guess the bottleneck is to read the directory

Re: ReadFile() in a loop becomes slower and slower

Posted: Tue Nov 12, 2024 11:15 am
by SMaag
I have a question about the ~
ExtractString(text, ~"cin\" value=\"", ~"\" id=")

what is the function of ~
I have never seen that. And I could not find it in the PB-Help

if I do a debug of it

Code: Select all

Procedure.s ExtractString(str.s, prefix.s, suffix.s)	; function to extract a text between two delimiters
  Debug prefix
  Debug suffix
EndProcedure

text.s ="Text"
cin$ = ExtractString(text, ~"cin\" value=\"", ~"\" id=")
i get:
cin" value="
" id=

Re: ReadFile() in a loop becomes slower and slower

Posted: Tue Nov 12, 2024 11:56 am
by NicTheQuick
With ~ you can use escaped characters inside a string.
It is mentioned in the help but I can not find it either right now. In this regard the help is not very useful. Some pages seem hidden and the search function is very bad since ever.

Re: ReadFile() in a loop becomes slower and slower

Posted: Tue Nov 12, 2024 1:13 pm
by HeX0R