ASM Tip and Tough Challenge

oldefoxx · Post by **oldefoxx** » Mon Apr 19, 2004 7:07 pm

Code updated for 5.20+

Here is a tip for anyone what wants to multiply a number by ten, and to do it fast.

Rather than multiply the number by 10, which is comparitively slow, you can do the same operation with a series of shifts and adds, like so:

Code: Select all

EnableASM        
somenbr.l=12345
          MOV   eax,somenbr
          ASL    eax,1
          MOV   ebx,eax
          ASL    eax,2
          ADD  eax,ebx
          MOV  somenbr,eax

This effectively adds somenbr*8+somenbr*2 ==> somenbr*10, so that if you did it correctly, DEBUG somenber should show 123450. This trick is commonly used when converting string values into numeric form, so that as you go through each digit character, you subtract 48 ("0" through "9" are coded as 48 through 57 in ascii), multiply the accumulated value by ten (using the technique above), then adding the new digit value to the total.

Now here is the challenge: Although multiply is slower than shifts and adds, it is not near as slow as division is. So if you wanted to reverse the process, of converting numeric form back into a string, you would want a way to extract each decimal digit, commonly by dividing by ten and using the remainder, and convert it back into an ascii character by adding 48 to it. And whereas you worked from left to right in processing the string value originally, you would build up the string in this process by adding the characters from right to left.

Code: Select all

        somenbr.l=12345
        s.s=""
        Repeat
           digit.b=somenbr % 10
           s=str(digit)+s
           somenbr=Int(somenbr/10) 
        Until somenbr=0
        Debug s

Now both the Mod operator (%) and the division step (/) involve divisions, and this means this effort is much slower than the first process. Yet in concept, the second method is nearly the exact reverse of the first one. In fact, if we go into ASM where we can take advantage of each instruction and register, we can attempt to consolidate the (%) and (/) outcomes together, reducing the number of divisions by half.

Unfortunately for us, the design of the x86 architecture destroys the original contents of the registers, so whether our division involved the ax, dx,ax pair, eax, or edx,eax pair, we are unable to repeat the process of division to obatain additional digits. Instead, we still have to use other registers, the stack, or some temporary memory location to track interum results. This is not efficient.

So the challenge is, to try to come up with a reversal of the original approach -- instead of using shift left and adds to effectively multiply by 10, devise a way to use right shifts and subtracts to effectively divide by 10.

I think you will find, as I have this far, that this is far from a trivial effort. It seems to enforce the concept that some processes cannot be reversed.
It's not that I am setting forth an impossible challenge - as far as I know, it should be achievable - but it is certainly not trivial in nature.

El_Choni · Post by **El_Choni** » Mon Apr 19, 2004 8:21 pm

What about this to multiply by ten?:

Code: Select all

lea eax, [eax*4+eax]
add eax, eax

Road Runner · Post by **Road Runner** » Mon Apr 19, 2004 8:49 pm

Suppose it is a 5 digit number..
Multiply by 1/10,000 (for 6 digits, use 1/100,000, 7 digits use 1/1,000,000 etc)
This leaves the first digit and a remainder.
Then Multiply the remainder repeatedly by 10 to obtain each of the following digits. This way there is one multiply per digit and no divides.

El_Choni · Post by **El_Choni** » Mon Apr 19, 2004 9:01 pm

Divide by ten (dividend in eax, quotient in edx):

Code: Select all

XOR edx, edx
DEC edx
DivLoop:
INC edx
SUB eax, 10
JNS l_divloop

Road Runner · Post by **Road Runner** » Mon Apr 19, 2004 9:32 pm

This does what is required by faster than the suggested method of trying to divide by shift/subtract.

Code: Select all

'convert a 5 digit number to ascii
'on entry esi points to location of answer
'eax contains the 5 digit number to convert

!mov ebx,10         ;useful constant
!mov ecx,429497     ;about to do div 10000 by reciprocal multiply
                    '429497 = 2^32/10000  rounded up

!mul ecx            ;div 5 digits by 10000, remainder is in eax

!add edx,&h30       ;top digit is left in edx, convert to ascii
!mov [esi],dl       ;store top digit

!mul ebx            ;multiply by 10 to get next digit into edx
!add edx,&h30       ;convert to ascii
!mov [esi+1],dl     ;and store

!mul ebx
!add edx,&h30       ;convert to ascii
!mov [esi+2],dl     ;and store

!mul ebx
!add edx,&h30       ;convert to ascii
!mov [esi+3],dl     ;and store

!mul ebx
!add edx,&h30       ;convert to ascii
!mov [esi+4],dx     ;and store, storing dx adds the zero termination for the string


'at this point, ESI points to the first character of a zero terminated string
'containing the final result

oldefoxx · Post by **oldefoxx** » Mon Apr 19, 2004 9:58 pm

The idea of using repeated subtractions or a multiplication using the recriprocal value of a power of 10 are interesting approaches, but subtraction takes many loops with large numbers, and multiplication assumes a fixed length digital number to convert, as you want some
fractional equivalent (1/100, 1/1000, 1/10000, etc.) The subtraction process then varies in time to perform, and the multiplication effort is unweldywith different ranges of numbers.

Not saying these are bad approaches, but they lack the elegance of the multiply-by-ten using shifts and adds. However, if you are interested in the variety of attacks, and time is not a consideration, you can also convert numbers from any base to any other base using logs. This would definately be a slow approach, though.

No, I'm looking for someone to come up with a simple and elegant method, though I know that it will in no way approach the multiply-by-ten simplicity.

Road Runner · Post by **Road Runner** » Mon Apr 19, 2004 11:29 pm

multiplication assumes a fixed length digital number to convert,

Assume the largest possible size of number and then throw away the leading zeroes, that'll give a routine to convert all sizes of number. 10 digit, signed numbers can be fully converted in around 100 CPU clks using this sort of method.

oldefoxx · Post by **oldefoxx** » Tue Apr 20, 2004 5:36 pm

Like the Multiply-by-10, the Divide-by-10 approach would benefit from
not having to perform multiply operations.

What I have discerned is that in order to accomplish this, rather than doing a divide using 10, you could achieve the same end if you could effect a divide by 1.25 and a divide by 8.

Now this seems a contridiction, since I am suggesting two divides in place of one. But the divide by 8 is merely shifting to the right 3 places. That reduces the problem to how to accomplish a divide by 1.25 by alternate means.

The reciprocal value of 1.25 is .8. This is not a convenient value to work with in binary. so the invert and multiply approach is not attractive at this point.

However, 1.25 can be expressed easily in powers of two, suggesting that it might be more easily resolved. So I am looking at making this happen, and maybe an approach for doing this will be the needed breaktrhough.

fweil · Post by **fweil** » Thu Apr 22, 2004 6:28 pm

Well, I do not answer exactly the post, but played a bit with PB and FASM instructions.

Have fun to change / embed it !

Code: Select all

;
; FWeil : 20040421
;
; The best way to divide any integer by 10 !
;
; They are many possibilities giving unexpected results.
;
; Just consider that to write y = x / 10 takes 5 to 10 seconds. Examining
; optimization cases took me a full 20 hours day of work. As it is.
; And probably then other PB coders will add some more days on it.
;
; By using the Console for output gives the most accurate results. Never test
; such tuning code using the debugger And Debug output.
;
; The code is written so that all embedding is 'outside' the process to
; evaluate. I tried To make it well so that you should only focus on the given
; request : how To divide fast an integer by 10 ...
;
; Notice that using Inline ASM does not apply well on floating operations, so
; that I prefered To use ! prefixed ASM lines.
;

;
; May you want To test up To 100 algorithms !
;
#MaxAlgo = 100
Dim tz.l( #MaxAlgo)
Dim Good.l( #MaxAlgo)

;
; Well ... from there just understand that I only expected to find a good template for testing possibilities and code results.
; As you will see, not all code parts give a good result in value, and you will not find good results in time.
; But it was a good self-training I made for dividing by 10 either using PB or FASM calls.
;
; For a good understanding of possible use of FASM in PB programs, do never forget that you manipulate the CPU and its registers
; which means that you work at the same level the compiler does. And the results you will obtain may exciting in a sample code
; but disastrous when expecting to use it in true application if you do not care enough about what registers do (and especially using
; the stack by pushing / poping values.
;
; I did not work enough for processing the whole possible knowledge required for doing better FASM samples, but I guess that
; just some small tuning from experts would give good values and better times.
;
OpenConsole()
  ;
  ; Let's use Count1 loops doing Count2 calculation of each algorythm. The x
  ; start value may vary Or not wether you decide To use an increment inside
  ; each Count1 loop. The total calculation you will have is Count1 * Count2.
  ; So may you adapt both paramters according To your CPU For seeing results at
  ; a human life scale time.
  ;
  ; The Algo variable is the algorithm index for a nice rendering of results
  ;
  Count1.l = 40 : Count2.l = 250000 : x.l = 123456789 : Increment.l = 17 ; I inserted a random x value below
  ;
  ; Here is the main loop inside which you will test each algorithm by
  ; processing Count2 loops inside.
  ;
  ;
  ; The variable x is there for testing most possible cases. So you can
  ; choose To adjust either iCount1 And iCount2 And x start value Or x
  ; increment. The cases studies should be made with varying values so that
  ; rounding issues may be detected (ie by incrementing x so that the decimal
  ; resulting part varies from 0.0 to 0.9. May you prefer to use Random($FFFFFFFF) ... as you like it !
  ;
  consoleLocate(1, 0)
  PrintN("Title         Algo#     Result      Time OK/False Pass#")

  For iCount1 = 1 To Count1 : x = Random($7FFFFFFF) : Check.l = x / 10 ; Check is the regular result you should find.

    Algo = 1 : Title.s = "PB   Integers : " : tz = GetTickCount_() : For i = 1 To Count2
      y = x / 10
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    Algo + 1 : k = 10 : Title.s = "PB   Integers : " : tz = GetTickCount_() : For i = 1 To Count2
      y = x / k
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    Algo + 1 : Title.s = "PB   Floats   : " : tz = GetTickCount_() : For i = 1 To Count2
      y = Round(x / 8 / 1.25, 0)
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    Algo + 1 : Title.s = "PB   Floats   : " : tz = GetTickCount_() : For i = 1 To Count2
      y = Round(x * 0.1, 0)
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    Algo + 1 : Title.s = "PB   Floats   : " : tz = GetTickCount_() : For i = 1 To Count2
      y = Round(x / 10, 0)
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    Algo + 1 : Title.s = "ASM  Integers : " : tz = GetTickCount_() : For i = 1 To Count2
      !  PUSH   dword [v_x]
      !  POP    dword [v_y]
      !  MOV   eax, dword[v_x]
      !  MOV   ebx, 10
      !  CDQ
      !  IDIV   ebx
      !  MOV   dword [v_y],eax
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *      ") : EndIf

    Algo + 1 : f.f = 1.25 : Title.s = "ASM  Floats   : " : tz = GetTickCount_() : For i = 1 To Count2
      !  MOV    ebx,dword [v_x]
      !  SAR    ebx,3
      !  MOV    dword [esp-4],ebx
      !  FILD   dword [esp-4]
      !  FDIV   dword [v_f]
      !  FISTP  dword [v_y]
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *      ") : EndIf

    Algo + 1 : f.f = 1.25 : Title.s = "ASM  Floats   : " : tz = GetTickCount_() : For i = 1 To Count2
      !  PUSH   dword [v_x]
      !  POP    dword [v_y]
      !  MOV    ebx,dword [v_x]
      !  SAR    ebx,3
      !  MOV    dword [esp-4],ebx
      !  FILD   dword [esp-4]
      !  FDIV   dword [v_f]
      !  FISTP  dword [v_y]
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    Algo + 1 : f1.f = 0.1 : f2.f : Title.s = "ASM  Floats   : " :  tz = GetTickCount_() : For i = 1 To Count2
      !  MOV    ebx,dword [v_x]
      !  MOV    dword [esp-4],ebx
      !  FILD   dword [esp-4]
      !  FMUL   dword [v_f1]
      !  FISTP  dword [v_y]
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    Algo + 1 : f1.f = 10 : f2.f : Title.s = "ASM  Floats   : " :  tz = GetTickCount_() : For i = 1 To Count2
      !  MOV    ebx,dword [v_x]
      !  MOV    dword [esp-4],ebx
      !  FILD   dword [esp-4]
      !  FDIV   dword [v_f1]
      !  FISTP  dword [v_y]
    Next : tz(Algo) = tz(Algo) + GetTickCount_() - tz : ConsoleLocate(1, Algo) : Print(Title + RSet(Str(Algo), 2, " ") + "  " + RSet(Str(y), 10, " ") + "   "  + "  " + RSet(Str(tz(Algo)), 5, " ")) : If y = Check : Good(Algo) + 1 : Print("    OK  ") : Else : Print("  Failed") : EndIf : Print(RSet(Str(Good(Algo)), 4, " ") + "/" + RSet(Str(iCount1), 4, " ")) : If Good(Algo) = iCount1 : PrintN("") : Else : PrintN(" *     ") : EndIf

    ConsoleLocate(1, Algo + 2)
    PrintN("Pass " + Str(iCount1) + " / " + Str(Count1))
    sDone.s = Str(Count2 * iCount1 ) : Done.s = "" : j = 0 : For i = Len(sDone) To 1 Step -1 : Done = Mid(sDone, i, 1) + Done : j + 1 : If j % 3 = 0 And i <> 1 : Done = "." + Done : EndIf : Next
    sToDo.s = Str(Count1 * Count2) : ToDo.s = "" : j = 0 : For i = Len(sToDo) To 1 Step - 1 : ToDo = Mid(sToDo, i, 1) + ToDo : j + 1 : If j % 3 = 0 And i <> 1 : ToDo = "." + ToDo : EndIf : Next
    ConsoleLocate(1, Algo + 3)
    PrintN("Done " + Done + " / " + ToDo + " operations for each algorithm")
    If Inkey() <> "" : Break : EndIf
  Next iCount1
Repeat : Until Inkey() <> "" : CloseConsole()

fsw · Post by **fsw** » Thu Apr 22, 2004 6:36 pm

Hey Francois, haven't seen you for a while.
Hope all goes well.

Franco

fweil · Post by **fweil** » Thu Apr 22, 2004 6:38 pm

Nice to be back there ... hope it will last. I will increase my post count much in the next days ...

fsw · Post by **fsw** » Thu Apr 22, 2004 6:43 pm

oldefoxx · Post by **oldefoxx** » Thu Apr 22, 2004 8:16 pm

The multiply by 2^32/10000 constant is an interesting approach. I would never have thought of it. Each succeeding digit can be obtained through the process of multiplying by 10. And potentially, you could perform the multiply-by-10 operations with a series of three shifts and an add.

However, making this happen turns out to be pretty messy. The main problem is that binary cannot represent exact equivalents of fractions based on powers of 10. 1/10th, as an example, becomes 1/8 - 1/32 + 1/64 - 1/128 - 1/512 + ...

And since you cannot aimultaneously add and subtract bits, you have to find another approximation that only consists of bits being added, such as 1/16 + 1/32 + 1/256 + 1/1024 + 1/2048 + 1/4096 + 1/8192 + 1/16384 + ...

This is a converging sequence, and while more and more bits means getting ever closer to the value of 1/10, you can never get exactly there.
That is why most financial calculations are done in terms of pennies for accuracy, and only reducted to dollars (N/100) in the final stages to minimize the error that will enevitably occur. This is because in a binary system, 100 can be expressed exactly, but 1/100 cannot.

And the problem is, it still takes a tremendous number of bits to even get close when trying to handle decimal fractions in a binary environment. Far beyond the number available to us in a 32 bit computer. Even trying to take advantage of shift-rotate instructions which allow us to carry one bit into the shift-rotate of another register or memory address, does not give us a very effective means of addressing this problem. We would need ASM instructions that would allow us to designate multiple addresses or registers for a common shift effort. The closest thing we have for doing that is the multiply and divide instructions, which simultaneously effect the DX and AX registers. And we already know that they are slow, because they are not optimized for shifting operations.

The real answer would be to have a DIV10 ASM instruction that performs this complicated and very specific task in hardware. A MUL10 instruction might also be useful, but the big need would be for a DIV10 -- unless you are satisfied with the speed that the clumsy software solutions that are required.

I once worked on a computer that had a DIV10 instruction in it, which I used, but until now I never appreciated its importance. I hadn't fully understood the difficulty of doing a divide by 10 using integer operations without resorting to the DIV operator. You might think that a DIV 10 and a DIV10 operation would be the same, but you can optimize a hardware approach so that it is just as fast as a normal shift operation.

But if you are going to talk about optimizing hardware, you might want to think about allowing operations over variable length structures, rather than fixed length registers. I understood that the old IBM 360 architecture was designed so that you could have word lengths of any size, making them well suited to handling numbers of great size while still retaining perfect precision. If my memory has not failed me, they also used decimal arithmetic in their hardware calculations, making them even better suited to finance.

Anyway, the idea of multiplying by a constant, bound by 2^32/10000, and then performing multiply by 10 to collect the digits from left to right, worked for numbers of five digits or so. But the loss of precision in representing 2^32/10000000 showed that handling numbers with a large number of digits was going to be much harder using that technique.

Whereas the process of using a N % 10 and Int(N/10) process should cause no such problems

Post by **Berikco** » Thu Apr 22, 2004 9:18 pm

fweil wrote:Nice to be back there ... hope it will last. I will increase my post count much in the next days ...

Hello Fweil

Nice to have you back!

Road Runner · Post by **Road Runner** » Thu Apr 22, 2004 11:47 pm

oldefoxx,
I thought up the multiply by 2^32/10000 on my own and I thought it was a really neat trick.. but then, as usual with neat tricks, I found it was already common knowledge.
Get a copy of the Athlon optimization manual from AMD and read the section on Integer Optimizations. AMD even provide a program (if you get the CD) which calculates the required multiplier so you can do almost any division by a constant.

There is no problem with the numbers not being exactly represented because, if the right constant is used in the multiply, the error is always less than one bit. On a modern, fast CPU you can test all possible outcomes in a few minutes to verify it for yourself.

PureBasic Forums - English

ASM Tip and Tough Challenge

ASM Tip and Tough Challenge

Interesting Approaches, but comparable?

Getting Away from Multiply Instruction

Still Working on it