Page 1 of 2

Faster file handling

Posted: Sat Aug 05, 2006 3:02 pm
by Tipperton
Continued from:

http://www.purebasic.fr/english/viewtopic.php?t=23007

I re-wrote a program I had originally written in PowerBasic into PureBasic because I wanted to give it a GUI to make it easier to use. PowerBasic's DDT system is overly complex compared to PureBasic's Gadget system.

End result: The PowerBasic program finished the file in less time than it took the PureBasic program to do one percent of the file!

So I know file operations (specifically string reading and writing) can be a lot faster.

The file is in comma separated fields and all I'm doing is combining a couple of the fields into one field, formatting a couple of others, and deleting a few others. Then writing the result to an output file.

Here's the code:

Code: Select all

If ReadFile(0, sTempName)
  FileBuffersSize(0, 65536)
  If CreateFile(1, sOutputName19)
    FileBuffersSize(1, 65536)
    If CreateFile(2, sOutputName20)
      FileBuffersSize(2, 65536)
      Dim asInputData.s(26)
      Dim asOutputData.s(24)
      Dim alFieldLength(24)
      SetGadgetText(#Gadget_fraProgress, "Processing: Preparing records")
      qInputLength.q=Lof(0)
      lOldPercent.l=-1
      Repeat
        DoEvents()
        lNewPercent.l=Int(Loc(0)/qInputLength*100+0.5)
        If lNewPercent<>lOldPercent
          SetGadgetState(#Gadget_barProgress, lNewPercent)
          lOldPercent=lNewPercent
        EndIf
        sInputLine.s=ReadString(0, #PB_Ascii)
        If CountString(sInputLine, ",")=26
          For lIndex=0 To 26
            asInputData(lIndex)=StringField(sInputLine, lIndex+1, ",")
          Next lIndex
          If Val(asInputData(4))=0 Or Trim(asInputData(4))="999999999"
            asInputData(4)=""
          EndIf
          If Val(asInputData(1))=5000 Or Val(asInputData(1))=5001 Or Val(asInputData(1))=8888
            Continue
          EndIf
          If UCase(Trim(asInputData(5)))="UNKNOWN" And Len(Trim(asInputData(4)))=0
            Continue
          EndIf
          If UCase(Trim(asInputData(5)))="AGGREGATE"
            Continue
          EndIf
          asOutputData(0)=asInputData(0)
          asOutputData(1)=RSet(Trim(Str(Val(asInputData(1)))), 5, "0")
          asOutputData(2)=RSet(Trim(Str(Val(asInputData(2))%100)), 2, "0")
          asOutputData(3)=RSet(Trim(Str(Val(asInputData(3)))), 5, "0")
          If Len(Trim(asInputData(4)))
            asOutputData(4)=RSet(Trim(asInputData(4)), 9, "0")
          Else
            asOutputData(4)=""
          EndIf
          If UCase(Trim(asInputData(8)))="P"
            asOutputData(5)=""
            For lIndex=5 To 7
              asOutputData(5)+RemoveString(asInputData(lIndex), " ")+" "
            Next lIndex
            asOutputData(5)=Trim(asOutputData(5))
          Else
            asOutputData(5)=Trim(asInputData(5))
            While FindString(asOutputData(5), "  ", 1)
              asOutputData(5)=ReplaceString(asOutputData(5), "  ", " ")
            Wend
          EndIf
          For lIndex=7 To 26
            asOutputData(lIndex-2)=asInputData(lIndex)
          Next lIndex
          sOutputLine.s=""
          For lIndex=0 To 24
            lFieldLen=Len(Trim(asOutputData(lIndex)))
            If lFieldLen>alFieldLength(lIndex)
              alFieldLength(lIndex)=lFieldLen
            EndIf
            Select lIndex
              Case 0
                sOutputField.s=Trim(asOutputData(0))+",N,N"
              Case 7, 14, 17, 18, 21
                sOutputField=Trim(asOutputData(lIndex))
              Case 24
                sOutputField=Trim(asOutputData(24))+","+Chr(34)+Chr(34)
              Default
                sOutputField=Chr(34)+Trim(asOutputData(lIndex))+Chr(34)
            EndSelect
            If Len(Trim(sOutputLine))
              sOutputLine+","
            EndIf
            sOutputLine+sOutputField
          Next lIndex
          If Val(asInputData(2))<2000
            WriteStringN(1, sOutputLine, #PB_Ascii)
          Else
            WriteStringN(2, sOutputLine, #PB_Ascii)
          EndIf
        EndIf ; CountString
      Until Eof(0) Or quitfrmUCPFilePrep
      If CreateFile(3, GetPathPart(sOutputName)+"fieldlengths.txt")
        For lIndex=0 To 24
          WriteStringN(3, Str(lIndex)+". "+Str(alFieldLength(lIndex)))
        Next lIndex
        CloseFile(3)
      EndIf
      CloseFile(2)
      quitfrmUCPFilePrep=1
    EndIf ; CreateFile
    CloseFile(1)
  EndIf ; CreateFile
  CloseFile(0)
EndIf ; ReadFile
I had ran into this before (as did other PureBasic users) with v3.x and switched to PowerBasic. The problem with PowerBasic is that it assumes you know and understand Windows API programming because that's all their DDT (Dynamic Dialog Tools) system is is just built in commands that mirror the API. When I discovered that PureBasic had released version 4 I decided to try it. It's nice but it appears the file reading and writting is still as slow as it was before.

BTW: I saw references to Rings' FastFile library but it doesn't appear to be on any of the PureBasic support sites like PureProject or PureArea. Is it no longer available?

Re: Faster file handling

Posted: Sat Aug 05, 2006 3:07 pm
by PB
What does DoEvents() do in the loop? I'd remove it, if possible, because if it's
releasing the CPU or something then that will definitely slow down the loop in
a big way.

Re: Faster file handling

Posted: Sat Aug 05, 2006 3:28 pm
by Tipperton
PB wrote:What does DoEvents() do in the loop? I'd remove it, if possible, because if it's releasing the CPU or something then that will definitely slow down the loop in a big way.
It just watches for the Abort button to be pressed and if it is, sets the quit_frmUCPFilePrep variable to 1.

Here's the code:

Code: Select all

Procedure DoEvents()
  EventID=WindowEvent()
  GadgetID=EventGadget()
  If EventWindow()=#Window_frmUCPFilePrep
    If EventID=#PB_Event_CloseWindow
      quitfrmUCPFilePrep=1
    ElseIf EventID=#PB_Event_Gadget And GadgetID=#Gadget_btnExit
      quitfrmUCPFilePrep=1
    EndIf
  EndIf
EndProcedure
I suppose I could add a line counter and only run DoEvents once every 100 lines or so... (there's about 3.5 million lines in the file)

Even better, I decided to try moving it to the code that updates the progress bar. And that did the trick, its acceptably fast now! :D

Thanks!

Posted: Sat Aug 05, 2006 3:53 pm
by b1be
you might try PureFile.

havent tested it

Posted: Sat Aug 05, 2006 4:18 pm
by Tipperton
b1be wrote:you might try PureFile.
I looked at it but its documentation said that PureBasic 4 (which is what I'm using) has buffered file I/O so it only really applies to v3

Re: Faster file handling

Posted: Sat Aug 05, 2006 4:22 pm
by Tipperton
Tipperton wrote:Even better, I decided to try moving it to the code that updates the progress bar. And that did the trick, its acceptably fast now! :D
That turned out to cause the window to be rather sluggish since event handling wasn't being done often enough so I went with the line counter method and set it to run DoEvents every 1,000 lines. That vastly improved the window's responsiveness without much loss in performance.

Now I'm going through and looking for unneeded redundancies.

Question: In many Basics the Str function adds a space either before or after the converted number, does PureBasic's Str function also do this? If not I could remove a bunch of Trim calls.

(Little bits of info like this would be really nice to have in the documentation...)

Posted: Sat Aug 05, 2006 4:41 pm
by Fred

Code: Select all

Debug "a"+Str(12546)+"a"
You can also try to use a thread if you want (check the threadsafe option).

Posted: Sat Aug 05, 2006 6:23 pm
by netmaestro
sInputLine.s=ReadString(0, #PB_Ascii)
That's your bottleneck right there. ReadString is fine for filesizes of a few mb, but if you're working with data in gig-sized files you have to read the data in decent-sized blocks at a time if you're looking for speed. 20-50 mb should work well. Just process the strings in the memory buffer once they're in, you'll see performance you won't believe.

Posted: Sat Aug 05, 2006 6:27 pm
by maw
Let me get this right..

First you state that PB's file handling is really poor. Then you state that PB's string handling is slow, or so you have read. And then, to everyone's surprise I'm sure :roll: , it turns out it was really your coding that was behind it... And now you want better documentation for something that can be tested with one line of code once (and for the record, I have never used a basic that inserted a space when using Str).

Fred, you deserve a medal for your patience!! In fact, you deserve a bunch of medals!!! :lol:

Posted: Sat Aug 05, 2006 6:33 pm
by Fred
:lol:

Posted: Sat Aug 05, 2006 6:45 pm
by Tipperton
maw wrote:First you state that PB's file handling is really poor.
Yup! If you had read the entire thread, you'd have read that I had tried this with v3.x and that I and others had noted that PureBasic's file handling (especially strings) was slow. Why else would Rings have written the FastFile library?
maw wrote:Then you state that PB's string handling is slow, or so you have read.
Bingo! I read it, but hadn't tested it, nor did I claim that it was, when some one asked if my record processing could be the bottle neck, I said possibly since I had read that some string functions weren't as fast as they could be and my processing does do mostly string work.
maw wrote:And then, to everyone's surprise I'm sure :roll: , it turns out it was really your coding that was behind it.
And I supose you've never written code before that needed tweaking?
maw wrote:And now you want better documentation for something that can be tested with one line of code once (and for the record,
PureBasic is nice but its documentation is rather weak, I'm getting tired of having to write test programs to find out something that could have been in the documentation. I've lost track of how many such programs I've written....
maw wrote:I have never used a basic that inserted a space when using Str).
Really!? Where have you been? Every Basic I've used except PureBasic puts a leading space in the output from the Str command.

Posted: Sat Aug 05, 2006 6:53 pm
by drahneir
Hello,
Tipperton is half right. Str(-1) would give no leading space, but Str(1) makes a space for the positive sign.

Posted: Sat Aug 05, 2006 7:00 pm
by Tipperton
drahneir wrote:Str(-1) would give no leading space, but Str(1) makes a space for the positive sign.
Exactly! I guess I should have said that myself....

Posted: Sat Aug 05, 2006 7:29 pm
by blueznl
but, after this all, i'm still a little confused, are you still saying that purebasic has slow file handling compared with powerbasic? that's the one thing that is not enitrely clear to me

you're sure you're using pb4?

Posted: Sat Aug 05, 2006 8:17 pm
by Tipperton
blueznl wrote:you're sure you're using pb4?
Yes I'm sure, I had initially bought PureBasic back in 2003, but in 2004 switched to PowerBasic so PureBasic was no longer installed. When I decided to try PureBasic again after getting fed up with PowerBasic's overly complex GUI handling, I had to re-install it and since 4.00 had been released, installed it instead of the 3.9x I had been using.
blueznl wrote:i'm still a little confused, are you still saying that purebasic has slow file handling compared with powerbasic? that's the one thing that is not enitrely clear to me
No, probably not.

That was my initial assesment because the PowerBasic version could process the whole file in about 10 minutes and the PureBasic version wasn't even at 1% complete after 10 mnutes. I remembered that there were comments about PureBasic 3.x file handling being slow so I figured that file handling in 4 was the same.

That thought turns out to be wrong. What was the cause of the slowness was where and how I was calling the DoEvents function. By changing it from being called for every processed line to once every 1,000 lines I got a significant speed increase while still allowing Windows the chance to process events from this and other running programs.

Although I haven't actually timed the two, I would guess that the two programs are now very close in speed with the main difference probably being that the PowerBasic version is a console application while the PureBasic version is a Windows application. I'm sure having a GUI sucks some speed out of a program.

I really do like PureBasic a lot and will be switching most (if not all) of my work to it instead of PowerBasic. Its so easy to hand code a GUI in PureBasic although I do have PureVisionXP so I don't have to. In PowerBasic you either have to be a Windows API wiz or use a form designer to do it and even then its so complex....

Now that I understand the mistake I made (with this project) my only real want is more complete documentation. The current documentation is good, its just missing little bits and pieces that would make using PureBasic easier if the information was in there.

A good example is the ProgramParameters() function. The current help file says that if you pass no parameters it will return the next command line parameter. It also says that you can pass a number to it and it will return that command line parameter number. What it doesn't say is whether they are numbered from zero or one.

Another example is that the documentation says that function return values are always longs unless otherwise noted, but Lof() returns a quad but theres no mention of that. I know that because in another project I needed to get the size of a file that grew to 6 GB. If Lof() returned a long it would not have returned the correct value after the file got over 2 GB in size, but it always returned the correct size.

Just little things like that. I noticed that someone had "volenteered" to work on the help file to improve it, I'll wait and see if anything happens, if not and I feel I have the time to devote to it, I may consider doing something there.