Code optimization (indent lines)?

Just starting out? Need help? Post your questions and find answers here.
oO0XX0Oo
User
User
Posts: 78
Joined: Thu Aug 10, 2017 7:35 am

Code optimization (indent lines)?

Post by oO0XX0Oo »

Hi,

a small console app that takes 3 params (last one is optional):
- A file (full path), UTF-8 BOM encoded that contains full paths of files / folders, #CRLF$ separated
- The "root level" of indentation
- Number of spaces to prefix deeper nested hierarchies

It creates this:

Code: Select all

Chromium_x64 [64.0.3282.140]
    !Do not update chrlauncher if NOT necessary.txts
    Chromium.lnk
    @HowTo Configure
        Add user scripts.txt
        Install an extension.txt
        Extensions
            @IDs of installed extensions.txt
            Cookie AutoDelete.txt
from that (content of the file: R:\test.txt):

Code: Select all

R:\Chromium_x64 [64.0.3282.140]
R:\Chromium_x64 [64.0.3282.140]\!Do not update chrlauncher if NOT necessary.txts
R:\Chromium_x64 [64.0.3282.140]\Chromium.lnk
R:\Chromium_x64 [64.0.3282.140]\@HowTo Configure
R:\Chromium_x64 [64.0.3282.140]\@HowTo Configure\Add user scripts.txt
R:\Chromium_x64 [64.0.3282.140]\@HowTo Configure\Install an extension.txt
R:\Chromium_x64 [64.0.3282.140]\@HowTo Configure\Extensions
R:\Chromium_x64 [64.0.3282.140]\@HowTo Configure\Extensions\@IDs of installed extensions.txt
R:\Chromium_x64 [64.0.3282.140]\@HowTo Configure\Extensions\Cookie AutoDelete.txt
When called like this:

Code: Select all

indenter.exe "R:\test.txt" 2 4
It can process e.g. 72.000 files in ~500ms. This isn't slow but I guess there is room for optimization.
Any hints on what could be improved?

A necessity: The final file should not have a trailing newline (because of this restriction I'm using two
WriteString(N) versions)...

Code: Select all

EnableExplicit

#PB_Compiler_IsMainFile = #True


; *************************************************************************************************

XIncludeFile "#includes\constants.pbi" : UseModule Consts


; *************************************************************************************************
; How to call this app
; Param 0 = file (UTF-8 BOM) in double quotes if it contains spaces
; Param 1 = root indentation level
; Param 2 = spaces per level (4 by default) - optional!
Procedure Main()
  Protected.i rootIndentation, spacesPerLevel, hFile, encoding, i
  Protected.s file, path, newFile, line

  ; Validate parameters
  If CountProgramParameters() < 2
    End #ExitCode_MissingParameter
  EndIf

  file = ProgramParameter(0)
  If FileSize(file) = -1
    End #ExitCode_FileDoesNotExist
  ElseIf FileSize(file) = 0
    End #ExitCode_FileIsEmpty
  EndIf
  rootIndentation = Val(ProgramParameter(1))

  ; Parameter is optional, default = 4
  If CountProgramParameters() <> 3
    spacesPerLevel = 4
  Else
    spacesPerLevel = Val(ProgramParameter(2))
  EndIf

  ; Get path of the file (for the new one to write)
  path = GetPathPart(file)

  ; Full path of new file
  newFile = path + GetFilePart(file, #PB_FileSystem_NoExtension) + "_new." + GetExtensionPart(file)

  NewList lines.s()

  ; Read file as UTF8 and place each line in a list
  hFile = OpenFile(#PB_Any, file, #PB_File_SharedRead|#PB_UTF8)
  If hFile
    encoding = ReadStringFormat(hFile)
    While Eof(hFile) = #False
      AddElement(lines())
      line = ReadString(hFile)
      lines() = LSet(" ", (CountString(line, "\") - rootIndentation) * spacesPerLevel) + GetFilePart(line)
    Wend
    CloseFile(hFile)
  EndIf

  ; Write all lines to a new file
  i = 1
  hFile = CreateFile(#PB_Any, newFile, #PB_File_SharedWrite|#PB_File_NoBuffering|#PB_UTF8)
  If hFile
    encoding = WriteStringFormat(hFile, #PB_UTF8)
    If ListSize(lines()) > 0
      ForEach lines()
        If i <> ListSize(lines())
          WriteStringN(hFile, lines(), #PB_UTF8)
        Else
          WriteString(hFile, lines(), #PB_UTF8)
        EndIf
        i + 1
      Next
    EndIf
    CloseFile(hFile)
  EndIf
  FreeList(lines())

  ; Overwrite old file
  If FileSize(newFile) > 0
    DeleteFile(file, #PB_FileSystem_Force)
    If RenameFile(newFile, file)
      End #ExitCode_Ok
    Else
      End #ExitCode_FileMoveFailed
    EndIf
  Else
    End #ExitCode_CreateNewFileFailed
  EndIf
EndProcedure


; *************************************************************************************************

; Call the main procedure
Main()

wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Code optimization (indent lines)?

Post by wilbert »

oO0XX0Oo wrote:It can process e.g. 72.000 files in ~500ms. This isn't slow but I guess there is room for optimization.
Any hints on what could be improved?
How many files do you need to process ?
There are several things you could do.

As far as I can tell you don't need a list.
You could open the input and output file at the same time.

If both the input and output file are UTF8, you could also process the files without using the PB string functions.
If you know the input file won't exceed a few megabytes, you could also load the entire file at once into memory.
Windows (x64)
Raspberry Pi OS (Arm64)
oO0XX0Oo
User
User
Posts: 78
Joined: Thu Aug 10, 2017 7:35 am

Re: Code optimization (indent lines)?

Post by oO0XX0Oo »

Hi wilbert,
How many files do you need to process ?
Can be up to a million items
You could open the input and output file at the same time.
Tried that yesterday (open both the input and the (temp) output file at the same time),
directly converting each line and writing it. It wasn't noticeable faster. It seems the list
isn't the speed limiting factor.
If both the input and output file are UTF8, you could also process the files without using the PB string functions.
How would that look like (codewise)?
If you know the input file won't exceed a few megabytes, you could also load the entire file at once into memory.
Tried that as well. I was then trying to use CountString to check how many #CRLF$ are present in that *buffer
and then a For ... Next loop with StringField()
but this was incredible slow. Not much experience with processing stuff directly in memory...
I guess this could lead to problems as well (trying to allocate memory), e.g. with this example:

- Let's say we have 1.000.000 files in one subfolder and let's assume that the
number of spaces to prepend is at least 8
Then this:

Code: Select all

R:\a
R:\a\test1.txt
R:\a\test2.txt
would turn into this:

Code: Select all

a
        test1.txt
        test2.txt
8 spaces > length of "R:\a\" (which is 5)
In this example 13 characters have been removed but 16 (the spaces) added

in other words: It could happen that the allocated memory isn't sufficient and I don't know if it's possible
to determine the minimum amount of memory to allocate upfront (in a correct way)...
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Code optimization (indent lines)?

Post by wilbert »

You can try something like the code below.
It uses a 1 megabyte memory buffer.

Code: Select all

#bufferSize = $100000 ; 1 megabyte buffer 

Procedure Main()
  Protected *m.Ascii, *m0, *end, *buffer, *spaces
  Protected.i rootIndentation, spacesPerLevel, indent, transport
  Protected.i src, dst, bytesRead
  
  rootIndentation = 2
  spacesPerLevel = 4
  
  *buffer = AllocateMemory(#bufferSize, #PB_Memory_NoClear)
  *spaces = AllocateMemory(spacesPerLevel, #PB_Memory_NoClear)
  FillMemory(*spaces, spacesPerLevel, 32)
  
  src = OpenFile(#PB_Any, "test.txt")
  dst = CreateFile(#PB_Any, "test_new.txt")
    
  Repeat
    bytesRead = ReadData(src, *buffer + transport, #bufferSize - transport)
    If bytesRead
      ; calculate end
      *end = *buffer + transport + bytesRead
      ; process buffer
      *m = *buffer
      *m0 = *m
      While *m < *end
        If *m\a = '\'
          ; write indentation
          *m0 = *m + 1 : indent + 1
          If indent >= rootIndentation
            WriteData(dst, *spaces, spacesPerLevel)
          EndIf
        ElseIf *m\a = #LF
          ; write file part
          WriteData(dst, *m0, *m - *m0 + 1)
          *m0 = *m + 1 : indent = 0
        EndIf
        *m + 1
      Wend
      ; transport remaining bytes to start of buffer
      transport = *m - *m0
      CopyMemory(*m0, *buffer, *m - *m0)
    EndIf
  Until bytesRead = 0
  
  If *m0 <> *m
    ; write last file part
    WriteData(dst, *m0, *m - *m0)
  Else
    ; remove CR/LF at end of dst
    While *m > *buffer
      *m - 1
      If *m\a = #CR Or *m\a = #LF
        FileSeek(dst, -1, #PB_Relative)
      Else
        Break  
      EndIf
    Wend
    TruncateFile(dst)
  EndIf
  
  CloseFile(dst)
  CloseFile(src)
  FreeMemory(*spaces)
  FreeMemory(*buffer)
  
EndProcedure

Main()
Windows (x64)
Raspberry Pi OS (Arm64)
oO0XX0Oo
User
User
Posts: 78
Joined: Thu Aug 10, 2017 7:35 am

Re: Code optimization (indent lines)?

Post by oO0XX0Oo »

Holy cow, that's fast!

Instead of taking about 500ms for the testfile (still 72k entries) it needs only about 75ms now.
5-6 times faster...

One little quirk:
If the src file does not end with an empty new line, the last line in dst will non contain the content
of the last (converted) line from src. The spaces to prepend (if there are any) will be written though.

It's not a deal breaker. I can control what is written to the src file so I can let it end with a new empty line.

I've added the necessary truncation to the dst file by myself (that wasn't the hard part) :)

Now I need to investigate what you're doing in your code (line by line). That's harder :mrgreen:

Thanks a lot wilbert!
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3942
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Code optimization (indent lines)?

Post by wilbert »

oO0XX0Oo wrote:If the src file does not end with an empty new line, the last line in dst will non contain the content
of the last (converted) line from src. The spaces to prepend (if there are any) will be written though.
I noticed that myself also.
I updated the code and added a check at the end of the main processing; don't know if you have seen that.
If the original file didn't end with LF, it writes the file part of the last line.
If it did end with CR or LF, it removes those from the destination file (as you indicated).
Windows (x64)
Raspberry Pi OS (Arm64)
oO0XX0Oo
User
User
Posts: 78
Joined: Thu Aug 10, 2017 7:35 am

Re: Code optimization (indent lines)?

Post by oO0XX0Oo »

Thanks again, wilbert!

Your modified version works absolutely fine now, including truncating the dst file
and even if the src file doesn't end with a CR/LF...
Post Reply