Find in Files

Windows specific forum
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Find in Files

Post by IdeasVacuum »

Related to this post about finding a file's EOL: most efficient way to find eol?

I wrote an app to find all occurrences of a string in a bunch of files - I wrote this because a bug in UltraEdit caused it's find-in-files to fail. That bug is still there but pre-bug, UltraEdit could process a bunch of files very quickly - though not as fast as NotePad++.

As-is, my app does what I need. It's a bit more flexible than the built-in functions of UE and NP++ too. It is, however, much slower. Below is a stand-alone code snippet (use your own test file and test search string). Not representative of the app but using the exact string search loop - the same loop I use in other util apps:

Code: Select all

EnableExplicit

#FileIO = 0

Structure StrList
sPath.s
sFile.s
sLnum.s
sStrf.s
EndStructure

Global NewList gStrFound.StrList() ;Info displayed in a ListIcon, User can delete selected  rows

Global  igFormat.i = 0
Global qgFileLen.q = 0
Global  *gTextBuff
Global sgEol.s = #CRLF$

Global dgStartTime.d, dgElapsedTime.d

Procedure PfFindString()
;#----------------------
;example find "Sleep"
Protected   qStart.q, qLen.q
Protected sFindStr.s = "Sleep"
Protected iPosn.i, iLine.i, iFound.i
Protected sLine.s

              SetPriorityClass_(GetCurrentProcess_(), #HIGH_PRIORITY_CLASS) ;#ABOVE_NORMAL_PRIORITY_CLASS)

              ;ForEach gFilesList() ;bunch of files already filtered as containing the string

              If ReadFile(#FileIO, "C:\MyTextFile.txt")

                        igFormat = ReadStringFormat(#FileIO)
                       qgFileLen = Lof(#FileIO)
                    If(qgFileLen > 0)

                                              *gTextBuff = AllocateMemory(qgFileLen)
                            ReadData(#FileIO, *gTextBuff, qgFileLen)
                           CloseFile(#FileIO)

                              ;On average (29 .cpp files in test, filtered to 12)
                              ;this search loop takes about 1.4 seconds per file (compiled exe);
                              ;17 seconds to process all files. Typical file 1000 - 3000 lines, 2 to 12 hits
                              ;NotePad++ processes all the files in just 1 second

                              dgStartTime = ElapsedMilliseconds()

                              qStart = 1 : iLine = 0 : qLen = 0

                              Repeat
                                                   iPosn = FindString(PeekS(*gTextBuff, qgFileLen, igFormat), sgEol, qStart, #PB_String_NoCase)
                                                If(iPosn > 0)

                                                            sLine = Mid(PeekS(*gTextBuff, qgFileLen, igFormat), qStart, (iPosn - qStart))
                                                           qStart = iPosn + Len(sgEol)
                                                            iLine = iLine + 1
                                                           iFound = FindString(sLine, sFindStr, 1, #PB_String_NoCase)
                                                        If(iFound > 0)

                                                              AddElement(gStrFound())
                                                                        ;gStrFound()\sPath = GetPathPart(gFilesList())
                                                                        ;gStrFound()\sFile = GetFilePart(gFilesList())
                                                                         gStrFound()\sLnum = Str(iLine)
                                                                         gStrFound()\sStrf = Trim(sLine)

                                                                Debug gStrFound()\sLnum + "  " + gStrFound()\sStrf
                                                        EndIf
                                                EndIf
                              Until(iPosn = 0)

                              dgElapsedTime = ElapsedMilliseconds() - dgStartTime
                              Debug dgElapsedTime / 1000
                      EndIf
              EndIf

              ;Next

              SetPriorityClass_(GetCurrentProcess_(), #NORMAL_PRIORITY_CLASS)
EndProcedure

PfFindString()
End
[/size]

I was expecting this method to be faster than it is, though it is OK, I can of course wait 20 seconds for the results. Is it the best way though? Is there a faster way to find every occurrence of the string and capture the line and line number?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Find in Files

Post by wilbert »

IdeasVacuum wrote:Is there a faster way to find every occurrence of the string and capture the line and line number?
Yes, use your own custom search functions.

One thing that makes things more complicated and a lot slower, is that you want a case insensitive search.
Personally I would probably create an asm based procedure which counts EOL characters very fast but also halts when the first character of your search string is encountered.
I would also not detect the EOL character in advance but determine it within the search function (It's always best to process each character of the file only once if possible).

You could also try to use ReadString instead of ReadData so you already know one line is read.
Windows (x64)
Raspberry Pi OS (Arm64)
Fred
Administrator
Administrator
Posts: 16681
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Find in Files

Post by Fred »

There is several issue with your code: using PeekS() for every line on the whole file is very time consumming, especially if your file is big. If you want to work on text file, just do the PeekS() once outside loop. Mid() is slow with big string as it always start from the beggining of the string to reach its mid point. As wilbert said, you could use ReadString() which allow to read the file line by line, using a fast buffered way.

Here is a reworked version which should work a lot faster (I hope it still does what you want to :P)

Code: Select all

;EnableExplicit

#FileIO = 0

Structure StrList
  sPath.s
  sFile.s
  sLnum.s
  sStrf.s
EndStructure

Global NewList gStrFound.StrList() ;Info displayed in a ListIcon, User can delete selected  rows

Global  igFormat.i = 0
Global qgFileLen.q = 0
Global  *gTextBuff
Global sgEol.s = #CRLF$

Global dgStartTime.d, dgElapsedTime.d

Procedure PfFindString()
  ;#----------------------
  ;example find "Sleep"
  Protected   qStart.q, qLen.q
  Protected sFindStr.s = "nDownloadingSize"
  Protected iPosn.i, iLine.i, iFound.i
  Protected sLine.s, FileContent$
  
  SetPriorityClass_(GetCurrentProcess_(), #HIGH_PRIORITY_CLASS) ;#ABOVE_NORMAL_PRIORITY_CLASS)
  
  ;ForEach gFilesList() ;bunch of files already filtered as containing the string
  
  If ReadFile(#FileIO, "C:\AMD\InternetReadFileEx.txt")
    
    igFormat = ReadStringFormat(#FileIO)
    qgFileLen = Lof(#FileIO)
    If(qgFileLen > 0)
      
      dgStartTime = ElapsedMilliseconds()
      
      qStart = 1 : iLine = 0 : qLen = 0
      
      Repeat
        sLine = ReadString(#FileIO)
        iFound = FindString(sLine, sFindStr, 1, #PB_String_NoCase)
        If(iFound > 0)
          
          AddElement(gStrFound())
          ;gStrFound()\sPath = GetPathPart(gFilesList())
          ;gStrFound()\sFile = GetFilePart(gFilesList())
          gStrFound()\sLnum = Str(iLine)
          gStrFound()\sStrf = Trim(sLine)
          
          Debug gStrFound()\sLnum + "  " + gStrFound()\sStrf
        EndIf
      Until Eof(#FileIO)
      
      dgElapsedTime = ElapsedMilliseconds() - dgStartTime
      Debug dgElapsedTime / 1000
    EndIf
  EndIf
  
  SetPriorityClass_(GetCurrentProcess_(), #NORMAL_PRIORITY_CLASS)
EndProcedure

PfFindString()
End
Mistrel
Addict
Addict
Posts: 3415
Joined: Sat Jun 30, 2007 8:04 pm

Re: Find in Files

Post by Mistrel »

Fred is amazing in that he is even here to answer questions. Fred, you're such a hero.
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Find in Files

Post by IdeasVacuum »

Thanks Fred!
I always thought that if the whole file was in memory, it could be searched faster than reading line-by-line from the hard drive :shock:
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Find in Files

Post by IdeasVacuum »

Thanks for your advice too Wilbert. It's a shame that it is necessary to have a case insensitive search. For most files it would be unnecessary but of course all files have to be catered for.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
IdeasVacuum
Always Here
Always Here
Posts: 6425
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Find in Files

Post by IdeasVacuum »

So, a single PeekS() is now used to determine if a file contains the search string. If it does, it's read line-by-line using ReadString() to find all occurrances and report them.

With this approach, results are delivered faster than I have seen in UltraEdit and at least as fast as NotePad++ :)

Thanks again guys.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
Post Reply