Restored from previous forum. Originally posted by MrVainSCL.
Ok, i am back with a big article about float handling and how to get correct results! I would give you an url for this tutorial, but please notice that i dont have this tutorial from any website! A good friend (scene-coder) of me gave me this article (i refound it on one of my older CD´s) !!! Hope the following stuff will help some of you and maybe we will have (automatic) correct working floats in one of the next pb versions!?

signed off
MrVainSCL aka Thorsten
Original article by Ilya Palopezhentsev (iliks)
As all of you know, a computer has two common types of numbers: integer and real. Obviously, they have different representations within a computer. At the first glance this doesn't matter much for a programmer but that is a mistake. These two types of numbers have one huge difference: integer operations are absolutely precise while real ones are not. There are two possible representations of real numbers - with fixed or with floating point. In this article I will cover only floating point numbers. They are more appropriate to the real life calculations than numbers with fixed point whose only advantage is the speed of handling.
I will try to show you eight facts which are often forgotten or neglected but which are very important if you want to get predictable and precise results. Of course precise results are a must for scientific and engineer calculations, and therefore they are interesting for me, but who knows - games and demos nowadays have a great deal of high math in them so this is applicable for them too.
Just one note: throughout the article I use the term "meaningful signs" / "significant digits". They are the same and tell us how many numbers stand after the leftmost zeroes:
0.0010 has two significant digits (1 and 0)
0.001 has one significant digit
345 has three significant digits (there are no zeroes before)
Notice that 0.0010 isn't equal to 0.001 because in the first case we strictly know that digit after 1 is 0, in the second case we don't know anything about it - and probably 0.001 was got from rounding of the number 0.0008? or 0.0005? Who knows?
So, let us look at some common bottlenecks of floating point numbers which should be taken into account:
I. Inaccuracy of floating point numbers
For a start something simple but not very obvious: Not every real number can be represented in a computer exactly. All numbers are transferred with a precision loss. This could give interesting results.
For example, the number 0.7 couldn't be represented exactly. If we write the following program:
Code: Select all
a = 0.7
print *,a
it will show us 0.6999999.
II. Unevenness of floating point numbers
Floating point numbers aren't equally spaced. They don't have an equal density at the whole domain of definition.
For example, let's consider two pairs of numbers:
First pair: 0 and 1
Second pair: 1000 and 1001
This fact is quite unbelievable but true: in a computer there are more floating point numbers between 0 and 1 than between 1000 and 1001 despite they have equal distance within a pair. So, the more the magnitude is, the more the precision loss is.
While around 0 you could easily represent numbers with a lot of digits after a comma, you can't do this around big numbers.
Why? The next fact will tell us the answer
III. 7/15 meaningful signs
32bit floating point numbers ("float" in C, "single" in Pascal, "real" in Fortran) hold only 7 meaningful digits. If you print a float variable in some programming language you (probably) will get 10 signs on screen. The two last digits are incorrect, you should drop them out.
(64bit floating point ("double" in C, "double" in Pascal, "double precision" in Fortran) holds about 15 significant digits.)
Floating point numbers are represented in exponential form:
1.00000000E+00
This is 1. You see, if the magnitude is small, the precision could be up to 7 digits (i.e. we could fill these zeroes with some numbers and they will stay).
But for numbers with a high magnitude, for example
1000000.0E+00
things are different.
You see, we have only one position after a comma. (Because the number has only 7 meaningful positions and 6 are taken by integer part.) So there are only about 10 numbers between 1000000 and 1000001:
1000000.0
1000000.1
1000000.2
1000000.3
...
1000000.9
You see, the less integer part is the more points after a comma we can represent in a computer.
IV. Comparison of floating point numbers
Floating point numbers shouldn't be compared with usual programming language's construction "==". Can you imagine?
The correct form for comparison on equality:
Code: Select all
instead of
if (x==y)
you should write
if (abs(x-y)
V. Precision loss
Each operation on a floating point number "eats" a certain degree of precision from this number. In quite a raw form it could be expressed this way: each operation on floating point number eats one correct digit from the result.
For example, you've entered a number with 7 correct digits. After a single operation with it the result will have only 6 correct digits and so on.
This isn't always so cruel but...
The moral is the following:
MINIMIZE THE NUMBER OF CALCULATIONS WITH YOUR DATA!
From this point 6 goes:
VI. Iterative calculations
Quite a widespread error. I confess, I made it in the past too.
Imagine we have two boundaries, a and b and step h. We must iterate from a to b with this step.
How does an unexperienced programmer do this? The answer:
Code: Select all
x = a
while (x
How to do it the right way? The answer:
Code: Select all
for i:=0 to N do
begin
x = a + i*h;
//calculate something from x
end
Oh yeah, assembler freaks will cry "we get multiplication on each iteration instead of a single addition!!!" Yes, this is slower but this is more correct.
You see, things that are fast are not always precise.
Want a concrete example?
Consider an integer
Code: Select all
pi/2
/ |pi/2
\ sin(x)dx = -cos(x)| = 0 - (-1) = 1
/ |0
0
It looks like:
Code: Select all
b
/
\ f(x)dx = h*( f(a)/2 + f(b)/2 + f + f + ... + f )
/ 1 2 n-1
a
Code: Select all
x - x = h
j i
I will just give you Fortran procedures from my current university task on "Calculations programming" subject:
1) "Stupid" implementation, each iteration we add h to d, it holds the current x.
Code: Select all
C a and b are limits of integration, N is the number of subranges
C we divide this interval into
REAL function tr_int(a,b,N)
h = (b-a)/N
c = (f(a)+f(b))/2.0
d = a+h
DO 1 i=1,N-1
c = c+f(d)
d = d+h
1 CONTINUE
tr_int = c*h
end
Code: Select all
REAL function tr_int(a,b,N)
h = (b-a)/N
c = (f(a)+f(b))/2.0
DO 1 i=1,N-1
c = c+f(a+i*h)
1 CONTINUE
tr_int = c*h
end
As I said earlier, let's choose a=0, b=pi/2 and consider the routines' output on various N's. I didn't write down concrete integer values, I printed the absolute error between correct integer value, 1, and the value calculated by procedure. Of course the less this error is the more correct our procedure is.
So, the comparison:
Code: Select all
___________________________________________
| N | "stupid" | "correct" |
------------------------------------------|
| 10 | 2.05703103E-03 | 2.05703103E-03 |
| 100 | 2.02221490E-05 | 2.05217548E-05 |
| 1000 | 1.25641850E-06 | 1.04455262E-06 |
| 10000 | 8.21720641E-06 | 1.29347461E-06 |
-------------------------------------------
You see, when N gets high the absolute error in "stupid" routine gets higher than in the "correct" one. The result will be more clear on higher N's.
VII. Make an action Towards the subtraction
Subtraction is the most imprecise operation from all algebraical operations in a computer. You should avoid it whenever it's possible, especially beware of subtraction of very close numbers because due to the round-off error they could turn to 0! (And if this number was in denominator you would get a "division on zero" error.)
I'm retyping an example from book "Numerical Methods" by E.Volkov:
"We should calculate sqrt(543)-sqrt(540) where 543 and 540 are precise numbers. (i.e. all of their digits are significant) We have
Code: Select all
sqrt(543) = 23.30....
sqrt(540) = 23.23....
sqrt(543) - sqrt(540) ~ 23.3 - 23.2 = 0.1
We have only one significant digit in the result.
Let's escape from the subtraction:
sqrt(543) - sqrt(540) = 3 / (sqrt(543)+sqrt(540))
(This was done by multiplication and division by sqrt(543)-sqrt(540).)
This is equal to 3/(23.3+23.2) = 3/46.5 = 0.0645, we still have 3 significant digits in the result."
VIII. Don't mix up both low and high, Big eats small without a sigh
The last tip: avoid operations where one operand is a small number and other is big. Small operand will be absorbed by big one. Explanation lies in the same 7 significant digits of mantissa in floating point numbers:
Code: Select all
100000.00E+00
+
1E-09
--------------
Code: Select all
100000.00
+
0,000000001
-------------
100000.000000001
IX. Conclusion
I have shown eight important facts about floating point arithmetic to you. I hope it will make your programs more clear and results more precise. Now you see - you can't deal with such numbers the same way you do with integers.
Mail me your comments and questions, please, if you have something to say. It's very hard to get motivated to write something without any feedback. I tried to show you everything in entertaining and easy way, so why not to say to me if I've succeeded or not?
This should be the authors mail: [url]mailto:iliks@pochtamt.ru[/url] or [url]mailto:iliks@mail.ru[/url]
PIII450, 256MB Ram, 80GB HD + 6,4 GB, RivaTNT, DirectX8.1, SB AWE64, Win2000 + all Updates...
greetz
MrVainSCL! aka Thorsten