Speed optimizing (Strings, again :-)

Just starting out? Need help? Post your questions and find answers here.
User avatar
Michael Vogel
Addict
Addict
Posts: 2797
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Speed optimizing (Strings, again :-)

Post by Michael Vogel »

I do some a simple compression for millions of numbers in a text file. There are different categories (#A, #B, #C,...) which are needed with different precision (#Fuzzy=0, 1,...)

The routine works fine and depending on the given values, the results are very good (in my case better than 20:1). My question is, if someone will have a look at these routines, if they can be optimized in speed (and in compression level, if you like):

Code: Select all

; Define

	; Check Routines...

	RandomSeed(0)

	#Number=20
	#Fuzzy=#False

	Global Dim ArrVals.s(#Number)
	Global Dim ArrComp.s(#Number)

	Global LenValue
	Global LenCompressed

	; Compression Routine...

	Enumeration
		#A
		#B
		#C
		#Buffersize=#C
	EndEnumeration

	Enumeration
		#InitialValue
		#PreviousValue
	EndEnumeration

	Global Dim Buffer.s(#Buffersize,#PreviousValue)
	#Nil="%"


; EndDefine

Procedure.s RemoveChars(nr,s.s)

	Protected i,a,b
	Protected type,flag

	Enumeration
		#InitialOnly
		#PreviousAndInitial
		#PreviousOnly
	EndEnumeration



	If Buffer(nr,#InitialValue)="";#Nil

		Buffer(nr,#InitialValue)=s
		Buffer(nr,#PreviousValue)=s

	Else

		i=-1
		flag=#False
		type=#PreviousAndInitial
		Repeat
			i+1
			a=PeekB(@s+i)

			If type>#InitialOnly;		also #PreviousAndInitial oder #PreviousOnly
				b=PeekB(@Buffer(nr,#PreviousValue)+i)
				;If (a*b=0) Or (a<>b)
				If (a=0) Or (b=0) Or (a<>b)
					If type=#PreviousAndInitial
						type=#InitialOnly
					Else
						flag=#True
					EndIf
				EndIf
			EndIf

			If type<#PreviousOnly;	also #PreviousAndInitial oder #InitialOnly
				b=PeekB(@Buffer(nr,#InitialValue)+i)
				;If (a*b=0) Or (a<>b)
				If (a=0) Or (b=0) Or (a<>b)
					If type=#PreviousAndInitial
						type=#PreviousOnly
					Else
						flag=#True
					EndIf
				EndIf
			EndIf

		Until flag

		Buffer(nr,#PreviousValue)=s

		If i>1; es bringt nichts, ein einzelnes Zeichen zu ersetzen...
			If type=#PreviousOnly
				ProcedureReturn Chr(i+96)+PeekS(@s+i)
			Else
				ProcedureReturn Chr(i+64)+PeekS(@s+i)
			EndIf
		EndIf

	EndIf

	ProcedureReturn s

EndProcedure
Procedure.s RemoveZeros(s.s,mode)

	Protected l

	l=FindString(s,".",1); 				Mode=1: eine Nachkommastelle merken...

	If l;	Komma vorhanden...
		If mode=0
			l=Len(s)-1;		Mode=0: alle Nachkommastellen merken...
		EndIf
		While PeekB(@s+l)='0';		redundante Nullen (hinten) abschneiden...
			l-1
		Wend

		If PeekB(@s+l)='.';			"Komma" als letztes Zeichen entfernen...
			l-1
		EndIf

		ProcedureReturn  Left(s,l+1)

	EndIf

	ProcedureReturn s

EndProcedure
Procedure.s ExpandString(i,zeile.s)

	Protected c

	If Buffer(i,#InitialValue)=#Nil
		Buffer(i,#InitialValue)=PeekS(@Zeile)
		Buffer(i,#PreviousValue)=Buffer(i,#InitialValue)

	Else
		c=PeekB(@Zeile)
		If c>'@';		64
			If c>'`';	96
				Zeile=Left(Buffer(i,#PreviousValue),c-96)+PeekS(@Zeile+1)
			Else
				Zeile=Left(Buffer(i,#InitialValue),c-64)+PeekS(@Zeile+1)
			EndIf
		EndIf

		Buffer(i,#PreviousValue)=Mid(Zeile,1);	ohne Kennung (§A,...) im Buffer speichern

	EndIf

	ProcedureReturn zeile

EndProcedure


; ------------------------------------------------------------------------------------------------------------------
; Compress...
; ------------------------------------------------------------------------------------------------------------------
For i=1 To #Number

	ArrVals(i)=Str(12498+Random(4))+"."+Str(Random(1000))+Str(Random(5))+Str(Random(1))
	ArrComp(i)=RemoveChars(#B,RemoveZeros(ArrVals(i),#Fuzzy))

	LenValue+Len(ArrVals(i))
	LenCompressed+Len(ArrComp(i))

Next i

Debug Str(LenCompressed) + " <– " + Str(LenValue)


; ------------------------------------------------------------------------------------------------------------------
; Decompress...
; ------------------------------------------------------------------------------------------------------------------
Buffer(#B,0)=#Nil
Buffer(#B,1)=#Nil

For i=1 To #Number
	Debug ExpandString(#B,ArrComp(i)) +" <– " + ArrVals(i) + " –> " + ArrComp(i)
Next i

Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: Speed optimizing (Strings, again :-)

Post by Trond »

It would help with some more information, like an example text file.
User avatar
Michael Vogel
Addict
Addict
Posts: 2797
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Re: Speed optimizing (Strings, again :-)

Post by Michael Vogel »

Trond wrote:It would help with some more information, like an example text file.
I'll try to give you an example, but it is only a small fragment of an original file:

Code: Select all

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<TrainingCenterDatabase xmlns="http://www.garmin.com...>
  <Activities>
    <Activity Sport="Other">
      <Id>2011-04-17T12:10:02Z</Id>
      <Lap StartTime="2011-04-17T12:10:02Z">
        <TotalTimeSeconds>1582.0000000</TotalTimeSeconds>
        <DistanceMeters>1341.4195557</DistanceMeters>
        <MaximumSpeed>0.0000000</MaximumSpeed>
        <Calories>0</Calories>
        <Intensity>Active</Intensity>
        <TriggerMethod>Manual</TriggerMethod>
        <Track>
          <Trackpoint>
            <Time>2011-04-17T12:10:02Z</Time>
            <Position>
              <LatitudeDegrees>40.5842670</LatitudeDegrees>
              <LongitudeDegrees>14.3435250</LongitudeDegrees>
            </Position>
            <AltitudeMeters>487.0500000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:10:12Z</Time>
            <Position>
              <LatitudeDegrees>40.5842510</LatitudeDegrees>
              <LongitudeDegrees>14.3434290</LongitudeDegrees>
            </Position>
            <AltitudeMeters>488.0100000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:10:28Z</Time>
            <Position>
              <LatitudeDegrees>40.5843280</LatitudeDegrees>
              <LongitudeDegrees>14.3433380</LongitudeDegrees>
            </Position>
            <AltitudeMeters>488.0100000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:10:51Z</Time>
            <Position>
              <LatitudeDegrees>40.5844310</LatitudeDegrees>
              <LongitudeDegrees>14.3433330</LongitudeDegrees>
            </Position>
            <AltitudeMeters>486.5600000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:11:04Z</Time>
            <Position>
              <LatitudeDegrees>40.5844150</LatitudeDegrees>
              <LongitudeDegrees>14.3432680</LongitudeDegrees>
            </Position>
            <AltitudeMeters>486.5600000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:12:10Z</Time>
            <Position>
              <LatitudeDegrees>40.5844150</LatitudeDegrees>
              <LongitudeDegrees>14.3432670</LongitudeDegrees>
            </Position>
            <AltitudeMeters>487.0500000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:12:38Z</Time>
            <Position>
              <LatitudeDegrees>40.5843660</LatitudeDegrees>
              <LongitudeDegrees>14.3432090</LongitudeDegrees>
            </Position>
            <AltitudeMeters>486.5600000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:12:50Z</Time>
            <Position>
              <LatitudeDegrees>40.5842900</LatitudeDegrees>
              <LongitudeDegrees>14.3430510</LongitudeDegrees>
            </Position>
            <AltitudeMeters>485.1200000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:13:09Z</Time>
            <Position>
              <LatitudeDegrees>40.5842130</LatitudeDegrees>
              <LongitudeDegrees>14.3427890</LongitudeDegrees>
            </Position>
            <AltitudeMeters>482.2400000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:13:29Z</Time>
            <Position>
              <LatitudeDegrees>40.5841670</LatitudeDegrees>
              <LongitudeDegrees>14.3425660</LongitudeDegrees>
            </Position>
            <AltitudeMeters>477.4300000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:14:08Z</Time>
            <Position>
              <LatitudeDegrees>40.5841640</LatitudeDegrees>
              <LongitudeDegrees>14.3424970</LongitudeDegrees>
            </Position>
            <AltitudeMeters>476.4700000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:14:54Z</Time>
            <Position>
              <LatitudeDegrees>40.5841060</LatitudeDegrees>
              <LongitudeDegrees>14.3422970</LongitudeDegrees>
            </Position>
            <AltitudeMeters>473.5900000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:15:13Z</Time>
            <Position>
              <LatitudeDegrees>40.5840420</LatitudeDegrees>
              <LongitudeDegrees>14.3420300</LongitudeDegrees>
            </Position>
            <AltitudeMeters>470.7000000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:15:14Z</Time>
            <Position>
              <LatitudeDegrees>40.5840420</LatitudeDegrees>
              <LongitudeDegrees>14.3420300</LongitudeDegrees>
            </Position>
            <AltitudeMeters>470.7000000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:15:49Z</Time>
            <Position>
              <LatitudeDegrees>40.5840450</LatitudeDegrees>
              <LongitudeDegrees>14.3420000</LongitudeDegrees>
            </Position>
            <AltitudeMeters>470.7000000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:16:01Z</Time>
            <Position>
              <LatitudeDegrees>40.5840030</LatitudeDegrees>
              <LongitudeDegrees>14.3418350</LongitudeDegrees>
            </Position>
            <AltitudeMeters>466.8600000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:16:12Z</Time>
            <Position>
              <LatitudeDegrees>40.5839650</LatitudeDegrees>
              <LongitudeDegrees>14.3416780</LongitudeDegrees>
            </Position>
            <AltitudeMeters>463.4900000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:16:13Z</Time>
            <Position>
              <LatitudeDegrees>40.5839610</LatitudeDegrees>
              <LongitudeDegrees>14.3416670</LongitudeDegrees>
            </Position>
            <AltitudeMeters>463.4900000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:16:28Z</Time>
            <Position>
              <LatitudeDegrees>40.5839310</LatitudeDegrees>
              <LongitudeDegrees>14.3414800</LongitudeDegrees>
            </Position>
            <AltitudeMeters>460.1300000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:16:47Z</Time>
            <Position>
              <LatitudeDegrees>40.5838970</LatitudeDegrees>
              <LongitudeDegrees>14.3411640</LongitudeDegrees>
            </Position>
            <AltitudeMeters>455.3200000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:17:09Z</Time>
            <Position>
              <LatitudeDegrees>40.5839310</LatitudeDegrees>
              <LongitudeDegrees>14.3408810</LongitudeDegrees>
            </Position>
            <AltitudeMeters>451.0000000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:17:28Z</Time>
            <Position>
              <LatitudeDegrees>40.5839610</LatitudeDegrees>
              <LongitudeDegrees>14.3407190</LongitudeDegrees>
            </Position>
            <AltitudeMeters>447.6300000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:17:29Z</Time>
            <Position>
              <LatitudeDegrees>40.5839650</LatitudeDegrees>
              <LongitudeDegrees>14.3407120</LongitudeDegrees>
            </Position>
            <AltitudeMeters>447.6300000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:17:40Z</Time>
            <Position>
              <LatitudeDegrees>40.5840570</LatitudeDegrees>
              <LongitudeDegrees>14.3406340</LongitudeDegrees>
            </Position>
            <AltitudeMeters>444.2700000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:17:43Z</Time>
            <Position>
              <LatitudeDegrees>40.5840610</LatitudeDegrees>
              <LongitudeDegrees>14.3406290</LongitudeDegrees>
            </Position>
            <AltitudeMeters>443.7900000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:17:55Z</Time>
            <Position>
              <LatitudeDegrees>40.5841640</LatitudeDegrees>
              <LongitudeDegrees>14.3405130</LongitudeDegrees>
            </Position>
            <AltitudeMeters>441.3800000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:18:05Z</Time>
            <Position>
              <LatitudeDegrees>40.5842210</LatitudeDegrees>
              <LongitudeDegrees>14.3404370</LongitudeDegrees>
            </Position>
            <AltitudeMeters>438.0200000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:18:17Z</Time>
            <Position>
              <LatitudeDegrees>40.5843580</LatitudeDegrees>
              <LongitudeDegrees>14.3404020</LongitudeDegrees>
            </Position>
            <AltitudeMeters>434.1700000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:18:18Z</Time>
            <Position>
              <LatitudeDegrees>40.5843730</LatitudeDegrees>
              <LongitudeDegrees>14.3404040</LongitudeDegrees>
            </Position>
            <AltitudeMeters>434.1700000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:18:31Z</Time>
            <Position>
              <LatitudeDegrees>40.5845410</LatitudeDegrees>
              <LongitudeDegrees>14.3404490</LongitudeDegrees>
            </Position>
            <AltitudeMeters>431.7700000</AltitudeMeters>
          </Trackpoint>
          <Trackpoint>
            <Time>2011-04-17T12:18:43Z</Time>
            <Position>
              <LatitudeDegrees>40.5846370</LatitudeDegrees>
              <LongitudeDegrees>14.3404770</LongitudeDegrees>
            </Position>
            <AltitudeMeters>427.9200000</AltitudeMeters>
          </Trackpoint>
        </Track>
      </Lap>
      <Notes>Italia - Termini</Notes>
    </Activity>
  </Activities>

  <Author xsi:type="Application_t">
    <Name>Garmin Training Center(r)</Name>
    <Build>
      <Version>
        <VersionMajor>3</VersionMajor>
        <VersionMinor>6</VersionMinor>
        <BuildMajor>4</BuildMajor>
      </Version>
  </Author>

</TrainingCenterDatabase>
And as a result of compacting this data you should get this:

Code: Select all

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<TrainingCenterDatabase ...>
<Activities>
N
<Id>2011-04-17T12:10:02Z</Id>
§*2011-04-17T12:10:02Z
O1582.
D1341.4195557
M20110417121002
B40.584267
L14.343525
A487.05
ML12
BG51
LF429
AB8.01
ML28
BF328
LF338
Af
ML51
BF431
Lh3
AB6.56
MK104
Bg15
LF268
Af
MK210
Bi
Lh7
AF
Ml38
BF366
Lg09
AB6.56
Ml50
BG9
LF051
AB5.12
MK309
BG13
LE2789
AB2.24
Ml29
BF167
Lf566
A477.43
MK408
Bh4
Lf497
Ab6.47
Ml54
Bg06
Lf297
Ab3.59
MK513
BF042
Lf03
Ab0.7
Mm4
Bi
Lh
Ae
Ml49
Bh5
Lf
Ae
MK601
Bg03
LE1835
A466.86
Ml12
BE3965
Lf678
Ab3.49
Mm3
Bh1
Lg67
Af
Ml28
Bg31
Lf48
Ab0.13
Ml47
Bf897
Lf164
A455.32
MK709
Bf931
LE0881
Ab1.
Ml28
Bg61
Lf719
A447.63
Mm9
Bh5
Lh2
Af
Ml40
BF057
Lf634
Ab4.27
Mm3
Bg61
Lg29
Ab3.79
Ml55
BF164
Lf513
Ab1.38
MK805
BG21
Lf437
A438.02
Ml17
BF358
Lg02
Ab4.17
Mm8
Bg73
Lh4
Af
Ml31
BF541
Lg49
Ab1.77
Ml43
BF637
Lg77
A427.92
<Notes>Italia - Termini</Notes>
N§
</Activities>
</TrainingCenterDatabase>
The first character in the output file is a prefix indicating the data type (A=Altitude, B=Latitude, L=Longitude, M=Time etc.) followed by the compressed data value.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Re: Speed optimizing (Strings, again :-)

Post by Trond »

The packer library seems to compress it just as well, why don't you use it?
User avatar
Michael Vogel
Addict
Addict
Posts: 2797
Joined: Thu Feb 09, 2006 11:27 pm
Contact:

Re: Speed optimizing (Strings, again :-)

Post by Michael Vogel »

Trond wrote:The packer library seems to compress it just as well, why don't you use it?
Maybe my approach is not very clever, but compressing is not the only target -- the source files have sizes of 250 MB and more and the compressed results are even (a bit) smaller than zipped files.

Beside a kind of "backward compatibility" it is very important to access each record quickly, which is done by some additional hash tables.

Another idea I had at the beginning was to use binary data for the floating point values, but because of the use of doubles this would cost even more space.
User avatar
skywalk
Addict
Addict
Posts: 4211
Joined: Wed Dec 23, 2009 10:14 pm
Location: Boston, MA

Re: Speed optimizing (Strings, again :-)

Post by skywalk »

If it were me, I would read in these 250MB files into a relational database such as SQLite or a homegrown internal array of structures. It doesn't make sense to repeatedly process the string content beyond an initial load into a structured vehicle. Then your "queries" would be lightning.
The nice thing about standards is there are so many to choose from. ~ Andrew Tanenbaum
kinglestat
Enthusiast
Enthusiast
Posts: 746
Joined: Fri Jul 14, 2006 8:53 pm
Location: Malta
Contact:

Re: Speed optimizing (Strings, again :-)

Post by kinglestat »

if file is always structured the same, and you are looking at speed I would suggest RLE for spaces and token replacements; data inside token start and token end (which you could use same token since you will always have a pair) you can compress with your PB or any such for further compression - though I dont think its necessary

indexing each token using pb maps and lists (Inside the map) will give you a tree structure you can search and walk - I use this approach in a DB tool I did which copies a database from mysl to mssql, mainly I store the tables in a list and there is another linked list inside table reference, references related tables.

cheers
I may not help with your coding
Just ask about mental issues!

http://www.lulu.com/spotlight/kingwolf
http://www.sen3.net
Post Reply