Very large numbers...

Everything else that doesn't fall into one of the other PB categories.
BackupUser
PureBasic Guru
PureBasic Guru
Posts: 16777133
Joined: Tue Apr 22, 2003 7:42 pm

Post by BackupUser »

Restored from previous forum. Originally posted by Hi-Toro.

Help! I'm trying to parse a folder to read all the file sizes, but I'm running into a problem: I can't represent the huge byte counts using either longs or floats. I think I need to use 64-bit numbers (or even just strings!). I can create a structure to hold a 64-bit number, eg:

Code: Select all

Structure Bit64
    high.l
    low.l
EndStructure
... but I don't know how to place a large value into that. An alternative that builds up a string would also be fine, but again I don't know how to do this :/

Here's an example of a number I want to display (I've tried with a float variable and StrF too BTW):

Code: Select all

bignumber = 4090514635
MessageRequester ("", Str (bignumber), 0)
Any help would be much appreciated... :)


--
See ya,
James L Boyd.
http://www.hi-toro.com/
--
BackupUser
PureBasic Guru
PureBasic Guru
Posts: 16777133
Joined: Tue Apr 22, 2003 7:42 pm

Post by BackupUser »

Restored from previous forum. Originally posted by Pupil.

Here's some procedures i made for a project i was working on, maybe there are better ways, but it worked satisfactory for me:

Code: Select all

; Structure for doublewords
Structure DoubleType
 low.l
 high.l
EndStructure

; Structure for BCD numbers
Structure BCDType
 BCD.b[12] ; only 10 is needed really
EndStructure

Structure CharType
 StructureUnion
  char.b
  c.b[0]
 EndStructureUnion
EndStructure

Dim BytePrefix.s(10)

Procedure.s StrDouble(*ptr.DoubleType, flag.w)
 tmp.BCDType
 a.DoubleType

 If flag ; output result like this 5.92GB
  a\low = *ptr\low
  a\high = *ptr\high
  count = 0
  Repeat
   If a\high > 0 Or a\high  $100000 Or a\low > 10
     count+1
     quit = #TRUE
    EndIf
   EndIf
  Until quit
  address.l = @a
  MOV ebx, address
  !FILD qword [ebx]
  address = @tmp
  MOV ebx, address
  !FBSTP [ebx]
 Else
  address.l = @tmp
  MOV ebx, *ptr
  !FILD qword [ebx]
  MOV ebx, address
  !FBSTP [ebx]
 EndIf
 result.s = ""
 For i = 0 To 9
  result = Chr(tmp\BCD[i]>>4&$f + '0') + Chr(tmp\BCD[i]&$f + '0') + result
 Next
 While ((Asc(result)  '9')) And Len(result) > 0
  result = Right(result, Len(result)-1)
 Wend
 
 If result = ""
  result = "0"
 EndIf
 
 If flag
  result + "," + Right(StrF(rest/1024,2), 2) + " " + BytePrefix(count)
 EndIf
 ProcedureReturn result
EndProcedure


Procedure ValDouble(*dest.DoubleType, src.s)
 tmp.BCDType
 address.l = @tmp
 *ptr.CharType = @src
 i = Len(src)-1
 count = 0
 For count = 0 To 8
  If i >= 0
   If *ptr\c[i] >= '0' And *ptr\c[i] = 0
   If *ptr\c[i] >= '0' And *ptr\c[i] <= '9'
    tmp\BCD[count] | ((*ptr\c[i] - '0')<<4)
   EndIf
   i-1
  EndIf
 Next
 MOV ebx, address
 !FBLD [ebx]
 MOV ebx, *dest
 !FISTP qword [ebx]
EndProcedure


Restore Label_BytePrefix
For i = 0 To 6
 Read BytePrefix(i)
Next

ValDouble(@dl.DoubleType, "4090514635")

Debug StrDouble(@dl, 1)

End

DataSection

Label_BytePrefix:
Data.s "Bytes", "kB", "MB", "GB", "TB", "PB", "EB"
EndDataSection
BackupUser
PureBasic Guru
PureBasic Guru
Posts: 16777133
Joined: Tue Apr 22, 2003 7:42 pm

Post by BackupUser »

Restored from previous forum. Originally posted by Hi-Toro.

Ah, now that looks perfect for my needs! Thanks a lot, Pupil! :)


--
See ya,
James L Boyd.
http://www.hi-toro.com/
--
oldefoxx
Enthusiast
Enthusiast
Posts: 532
Joined: Fri Jul 25, 2003 11:24 pm

64 (and more) bit long adds

Post by oldefoxx »

each register in a x86 is either a half register (8 bits or one byte) long, a full register of 16 bits, or an extended register of 32 bits (which happens to correspond to a long in PureBASIC). If you consider the AX register (15 bits), you can refer to the upper 8 bits (AU), the lower 8 bits (AL), the whole register (AX, remember?), or the extended register (EAX), There are three other registers that you can treat in essentially the same way: BX, CX, and DX. In addition, you have two index registers (SI and DI, meaning Source Index and Destination Index) two or more Segment addresses (DS, ES, and possibly GS and GS), and some other more specialized registers that you don't normally mess with. Except for the SP m(stack pointer) and the EP (Extra Pointer), which together, are often used to preserve your existing stack while letting you extend it temporarily for other purposes.


Is all that clear? Of course not. But the thing is, the computer registers and instructions set support two types of numerical calculations: Signed, similar to a Long, and unsigned, consistent with a double word. Now you can get the effect of 84-bit adds and subtracts if you take two registers and treat them as two halves of a larger, 64-bit register. For this example, lets say you treat the register pair EBX and EAX in that manner, with EBX being the higher 32-bits. And lets say you add some long to that pair - but wait! perhaps the long is so long that the value wrapped around to become negative by setting the sign bit (upper most bit in a long)! Is that a problem? Not really, because we are going to perform unsigned arithmetic when we add the values together.

It basically breaks down this way:

up64.l=0
lo64.l-0
Repeat
;get the next file size here and put it into file32.l
file32,l=LOF(filex)
CLOSE filex
MOV ebx, up84 ;load the hi 32 bits
MOV eax,lo64 ;load the low 32 bits
ADD eax,file32 ;add normally the length of the current file
ADC ebx,0 ;add with carry a zero to the upper bits
MOV up64,ebx ;now save the results
MOV lo64,eax ;for both upper and lower bits
UNTIL ... ;until you have no more files.

The reason you cannot do the following instead is that each instruction takes two references, in one case two memory locations for one instruction, and in the second case, a constant (0) and a memory address. This is not possible in Assembler, as whenever you have two
references for one instruction, either one or both references must refer to a register. For instance, you could do a MOV au,al which would involve two registers (or half registers in this case), or you can do an ADD ax,bx, to get the effect of ax=ax+bx.

ADD lo64,len32 ;two memory addresses, not possible
ADC up64,0 ;a memory address and constant, wrong

Note that if you find 64-bit calculations to be too short, you can create three-and four-register sets. For adds, you just have to remember to use the ADD for the lowest register first, then ADC for each register above that. Subtract works in a simular fashion. You start with SUB for the low register, then SBB for each one above that in turn. The ADC does a 1-bit overflow up to the next higher register, and SBB does a 1-bit borrow from the next register higher. Adding or subtracting 0 is just a way to get the carry or borrow bit reconciled with the next higher register so that you get an extended result.

Assembler makes this easy. It is a struggle when trying to do it with Longs and normal (signed) arithmetic, which is what the compiler does.
has-been wanna-be (You may not agree with what I say, but it will make you think).
oldefoxx
Enthusiast
Enthusiast
Posts: 532
Joined: Fri Jul 25, 2003 11:24 pm

Post by oldefoxx »

Sorry, asfully tired tonight. Some typos I didn't catch. 15-bit reference in one place instead of 16, 84-bits stated (should have been 64 bits, typo),and meant the len32.1 = LOF(filex) instead of file32=LOF(filex) for consistency with ADD eax,len32 instruction. All assembler instructions should have started with a ! (exclamation mark) on the line to ensure that they are recognized as ASM codes for faster action.
has-been wanna-be (You may not agree with what I say, but it will make you think).
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

> I'm trying to parse a folder to read all the file sizes, but I'm running into
> a problem: I can't represent the huge byte counts using either longs or
> floats.

See this thread:

viewtopic.php?t=2267
D'Oldefoxx
User
User
Posts: 30
Joined: Wed Aug 06, 2003 5:52 pm
Location: Florida, USA

Big Integer Arithmetic

Post by D'Oldefoxx »

Floats sometimes get a bad rep. Actually, they scale very well, and should be well capable of representing any file size you encounter. The problem is, they cannot do this accurately, down to the very last digit. And if a result from a function is intended to be a Long, then simply converting it at that point to a Float isn't going ro fix the problem -- the significant part of the number has already been lost. Now there are several things that can be done to try to avoid this, and one of them is to use af ASM language at the junction where the procedure is called, so that the returned value can be dealt with in its component parts - assuming that the routine has all those component parts present in its registers.

If the language you are using does not return all the component parts, you may have to move a bit upstream, possibly using a WindoewAPI call in place of the built-in language support. For DOS compatability, Windows still supports the ability to call Interrupts, which likewise returns values in its registers. So ultimately, let's just say that you are going to end up with all the component parts of some very large number in two or more registers that somehow have to be combined into one large number.

At this point you could easily decide that a Float is not the way to go. True, it should scale sufficiently for the job, but you probably do not want to lose the accuracy of your byte count. As structure involving two Longs could also work, but the math gets a bit trickyy, what with carries and everything. You want a painless way to do this, and you really find you are not happy with the fact that the language isn't giving you much help at this point.

One solution has been the adoption of string arithmetic. That is, any number can be represented in string form, such as "1234567890123", and made virtually any size. Depending upon the length of an individual string, the number of digits allowed could be hundreds, thousands, or even more digits in length. The limit in PureBASIC is a maximum of 250 characters in a string. Strings can also be used to represent a decimal part by allowing a decimal point {",") somewhere in the string. but often this is not supported, as it makes the resulting computations somewhat more difficult to perform.

Now imagine if you had two numbers in string form to add together, such as "1234567890123456" plus "45678901234", How would you perform
this? Going back to the way you learned to add in school, you would
approach it somthing like this:

Code: Select all

                   "1234567890123456"
                  +     "45678901234"
In other words, you would approach it with the two values aligned from the right., And then you add, digit-by-digit, from right to left, remembering if you have a carry for the next position or not/ Now when adding two numbers, digit by digit, the worse case senerio is adding teo "9"'s for the first add. The result is "18", which is "8" plus a carry of "1". On the second and subsequent adds, the worse case is adding two "9".s with a carry of "1". This combination would give you a "19", which would be a "9" with a carry of "1". In other words, if there is a carry, it will always be a "1", nothing more, unless you are adding a column of numbers together.

To make the job simpler. it would help if the length of both strings was the same. You can accomplish this easily by adding a string of leading "0"'s to the shorter string like so"

Code: Select all

   Numb1$=    "1234567890123456"
   Numb2$=    "0000045678901234"
 
And you could use a FOR - NEXT loop to range through the whole sequence

Code: Select all

carry=0
ans$=""
For a=Len(numb1$) To 1Step -1
     digit1=Val(Mid(numb1$,a,1))
     digit2=Val(Mid(numb2$,a,1))
     digit3=digit1+digit2+carry
     ans$=Str(digit3 Mod 10)+ans$
     carry=Int(digit3/10)
Next
If carry
  Ans$="1"+ans$
EndIf
But wait! Can;t we use the fact that the computer can process multiple digits at the same time as a way to make this faster? Of course we can!
For instance, suppose we want to process the results in terms of six digits at a time. We just have to make sure that the length of both digit strings is a multiple of six

Code: Select all

a=Len(numb1$)
b=Len(numb2$)
If a<b 
  a=b                                   ;pick the longer string to start with
EndIf
b=(a*6+5)/6                        ;force a roundup if over a multiple of 6
If b>Len(numb1$)
   numb1$=String(b-Len(numb1$),"0")+numb1$
EndIf
If b>Len(numb2$)
    numb2$=String(b-Len(numb2$),"0")+numb2$
EndIf
carry=0
ans$=""
For a=b-5 to 1 Step -6
   digits1=Val(Mid(numb1$,a,6)
   digits2=Val(Mid(numb2$,a,6)
   digits3=digits1+digits2+carry
   tmpans$=Str(digits3)
   ans$=right(tmpans$,6)+ans$
   carry=Len(tmpans$)=7
Next
If carry
  ans$="1"+ans$
EndIf
Now back to the heading of this message: You can find a number of libraries on the internet that support string arithmetic operations. These generally go by the name of BIG INT or something similar. Some claim to support string calculations of up to 2 GBytes, and include not only additiion and subtrtaction, but multiply, divide, logs, trigs, and other processes. Obviously the more advanced libraries are handling decimal points and extended decimal places just fine. But the bad thing about string arithmetic is that it is not very efficient. Each character, encoded into 8 bits, has only ten states reflected: The ASCII codes for digits "0" through "9". If you subdivided each byte into 4 bits (sometimes called a nyble),
you would still have 16 possible states, of which the first ten could represent the values 0 through 9 (from 0000 to 1011). This is called
Binary Coded DEcimal, or BCD for short. That means a string of 250 characters could represent as many as 500 digits!. Of course there would
have to be some added conversion time, and the ANS$ string could not just simply be printed to the console as-is, but it would mean more efficient packing of the values.

But if you are going that far, you could also consider that each byte on its own could represent 256 values, from 0 to 255. That compares to just the 100 possible states those same bits would represent in BCD, or just ten states in ASCII form. In bimary form then, 250 characters would represent 250^256 states (you work it out, I don't have anything on hand that will handle numbers that large)..

So either the code above should help you get started with some version of string arithmetic, or you can now go looking for a library on the internet that has all the routines you could ever want or use. :P
Has-been Wanna-be (You may not like what I say, but it will make you think)
Post Reply