It is currently Mon Sep 23, 2019 1:44 am

All times are UTC + 1 hour




Post new topic Reply to topic  [ 15 posts ] 
Author Message
 Post subject: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Sun Dec 16, 2018 9:16 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Thu Apr 30, 2009 5:23 pm
Posts: 302
Location: Côtes d'Azur, France
How do you optimise a x86 code to x64 ?

I converted a x86 to x64 code but it's actually 2 times slowler in x64.
I was expecting the x64 code to be as fast as x86 with these simple macros.

Code:
CompilerIf #PB_Compiler_Processor=#PB_Processor_x64
    Macro eax : rax : EndMacro
    Macro ebx : rbx : EndMacro
    Macro ecx : rcx : EndMacro
    Macro edx : rdx : EndMacro
    Macro esi : rsi : EndMacro
    Macro edi : rdi : EndMacro
    Macro ebp : rbp : EndMacro
    Macro esp : rsp : EndMacro
    Macro dword : qword :EndMacro
    Macro octets : 8 :EndMacro
    Debug "Heapx64"
CompilerElse
    Macro octets : 4 :EndMacro
    Debug "x86"
CompilerEndIf

_________________
There are 2 methods to program bugless.
But only the third works fine.

Win10, Pb x64 5.70 LTS


Last edited by Fig on Mon Dec 17, 2018 6:16 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 1:34 am 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue Sep 22, 2009 10:41 pm
Posts: 434
Hello fig,

could you show an little example to test this gap?


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 10:49 am 
Offline
Administrator
Administrator

Joined: Fri May 17, 2002 4:39 pm
Posts: 13614
Location: France
Without code, we can't tell but x64 has way more registers available, so you should use them instead of stack


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 5:39 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Thu Apr 30, 2009 5:23 pm
Posts: 302
Location: Côtes d'Azur, France
I didn't want to bother you with code, sorry.

1/ I would like to understand why I am wrong thinking x86 is as fast as x64 if you just replace x86 register by x64.
My logic was:
Fact (?) :
My cpu's memory bus is 64 bits large. My Cpu as well and also my Os.
Hypothesis :
If I ask for a 32bit dword she (yes my computer is a girl) has to move 64 bits anyway.

So if i replace 32bits register by 64 ones, it should be as fast without any further optimisation.
My result shows me wrong ...

2/ I was looking for general rules or strategy to optimise x86 in x64.
As suggested, I can use more register to prevent access to memory in loop.
I guess I can also keep a 32 bits data format, transfer 2 in one qword and compute double data (if my code fits this logic)
Anything else ?


3/ Some code... But It's probably pretty indigest and clumsy as it is. Nobody should force himeself into this.
Warning, it took arround 80 sec on my computer to get a result to my speed test (there are two of them), so be patient if you try this.
It's a hash table home made for my specific need. It uses chained clusters of 15 lines (I record 7 values per line and use X and Y coordonates as key)
I compare its speed to Pb's in my use (ie: 8 adds, one search and find result, one search and don't find result)
I compare also x64 vs x86 speed.
My point is not to compare Pb general Map to a specific Hashtable, of course.
Quote:
; Add 8 000 000 elements and SEARCH/FIND 500 000 elements et Search and don't find 500 000 elements
; ___x86______ x64
; 26224ms___40305ms :My Hash add elements
; 80436ms___87073ms ;Pb Hash add elements
;___144ms_____585ms ;My Hash Searching for elements
;___589ms_____890ms ;Pb Hash Searching for elements


[

_________________
There are 2 methods to program bugless.
But only the third works fine.

Win10, Pb x64 5.70 LTS


Last edited by Fig on Tue Jan 15, 2019 9:27 pm, edited 4 times in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 5:59 pm 
Offline
PureBasic Expert
PureBasic Expert

Joined: Sun Aug 08, 2004 5:21 am
Posts: 3443
Location: Netherlands
Fig wrote:
I was looking for general rules or strategy to optimize x86 in x64.

I think it's hard to give a general rule.
If you look at this document
https://www.agner.org/optimize/instruction_tables.pdf
you will see that on most processors a 64 bit division for example is much slower compared to doing a 32 bit division.
So it's not wise to simply replace every dword with a qword when you convert your code.
Memory alignment can also make a difference when it comes to performance.

_________________
macOS 10.14 Mojave, PB 5.62 x64


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 6:11 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Thu Apr 30, 2009 5:23 pm
Posts: 302
Location: Côtes d'Azur, France
I will get rid of Div by shifting and substracting to obtain the rest... Thank you for pointing this.
Concerning memory alignment, it shouldn't be a problem as all my data are 64 bits when I'am in x64 mode, i don't mix different sizes.

Nethertheless, what's wrong with my logic ? I have a doubt... Do computers with x64 cpu have 64bits memory bus width ? (not inside the cpu, i mean bus between CPU/Ram) :?:

_________________
There are 2 methods to program bugless.
But only the third works fine.

Win10, Pb x64 5.70 LTS


Last edited by Fig on Mon Dec 17, 2018 7:10 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 6:34 pm 
Offline
Addict
Addict
User avatar

Joined: Sat Feb 19, 2005 2:46 pm
Posts: 1792
Location: Pas-de-Calais, France
Maybe you could try some registers only tests, then register to memory, and so on. For what I know, it's never as simple and logic as it seems. For example, having a 48 cores CPU versus 2 cores, when program is not made for this. Or program logic not adapted to specific cpu/cache sizes. Somehow, if a JIT compiler can adjust its optimization scheme to a specific platform, a fixed compiled program can also have specific paths.


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 9:46 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue Sep 22, 2009 10:41 pm
Posts: 434
Not easy...

First, you begin your procedures by a transit like :
Code:
Procedure ProcName(*Ptr)

MyHighLevelVar = *Ptr + x ; transit
...

EnableASM
...
Reserve RBP to base procedure argument pointor.
Code:
Procedure ProcName(*Ptr)

; no transit

EnableASM
mov RBP, [p.p_ptr]

; no ASM transit either : use directly in operations statements
; i.e. :
add Rbx, [RBP+8]
; etc...
No there is no 64 bits addressing between CPU and RAM, 64 bits is the data RAM way size. But you have no access to this. For each threaded code, you have a 64 bits memory partial access : internal hardware converts your virtual addresses to physical addresses.


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 10:01 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue Sep 22, 2009 10:41 pm
Posts: 434
Code:
INC R14

+
Code:
Mov rax,r14
Mul Rbp

=
Code:
Add rax,rbp

no mul.


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 10:13 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue Sep 22, 2009 10:41 pm
Posts: 434
Code:
MOV edi,edx ;sauvegarde l'adresse du cluster précédent
MOV ebp,dword [edx+octets] ;charge l'adresse du cluster suivant
CMP ebp,0
Jcc Label

Use pipeline
Code:
Xor ebp,ebp
Mov edi,edx
Add ebp,dword
Jcc Label

Same on X64


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 10:40 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue Sep 22, 2009 10:41 pm
Posts: 434
Code:
ProcedureDLL SuperiorPower2(number.i)
EnableASM
MOV ebx,[p.v_number]
BSR ecx,ebx
MOV eax,1
SAL eax,cl
CMP eax,ebx
JE _suite
SAL eax,1
!_suite:
ProcedureReturn
DisableASM
EndProcedure

I do not know how is [p.v_number] behavior after your initial Exx -> Rxx macro.

For the rest, it seems that a useless high 32 bits running occurs but I did not find where... No computer...


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Mon Dec 17, 2018 10:48 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue Sep 22, 2009 10:41 pm
Posts: 434
fig wrote:
I will get rid of Div by shifting and substracting to obtain the rest... Thank you for pointing this
Just a AND mask to get modulo.
Code:
! AND RAX, 1 ; rest of modulo 2
! AND Rbx, 3 ; rest of modulo 4
! AND RCX, 7 ; rest of modulo 8
; etc...


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Tue Dec 18, 2018 11:37 am 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Sun Jun 22, 2003 7:43 pm
Posts: 426
Location: Germany, Saarbrücken
Because I am a noob with ASM, I just want to point out some architecture dependent ideas, but maybe you know them already.
  • If your CPU wants a specific value from your memory it reads a fixed sized memory block into one cache line, so it can have faster access to that specific address and the memory around it.
  • If you need fast access to a bunch of data try to align it properly, so your CPU does not have to read from two memory locations and merge the data together before it can use it.
  • Try to store data in continuous memory blocks instead of linking small bunches of memory together like LinkedLists do. This way cache misses are way more infrequent.
  • When traversing through a multi dimensional array always traverse the way the elements are stored in the physical memory. For example when iterating over each pixel of an image, your outer for loop should iterate over Y and your inner loop over X.

_________________
Electronics, Crazy & Interesting Stuff, all that with text, image and sound? Click here!

The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Tue Dec 18, 2018 3:10 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Tue Sep 22, 2009 10:41 pm
Posts: 434
NicTheQuick wrote:
Because I am a noob with ASM,
Hello,

I forgot introducing my suggestings by such a message. I am noob too. CPU is very complex. I just took many hours, any years ago, to measure several ASM statements groups with RDTSC.

Old forum users published too, many very good tests any years ago, I do not remember the links in this forum.

I think fig is searching to manage discontinuous datas, what it seems to be solved by using special technical ways.

X86/X64 treat discontinuous datas, obsviously by storing code addresses in cache. I can say this, after several performance tests when several jumping calls do not respect a linear benchmark. The problem here is that the informations are not code addresses but datas addresses, what it grows the coding difficulty level.

It would be like if slots datas were subroutine codes, what it would require to modify ASM code on the fly.


Top
 Profile  
Reply with quote  
 Post subject: Re: Do you have rules to optimise x86 x64 conversion ?
PostPosted: Fri Dec 21, 2018 8:50 pm 
Offline
Enthusiast
Enthusiast
User avatar

Joined: Thu Apr 30, 2009 5:23 pm
Posts: 302
Location: Côtes d'Azur, France
I'll have some times to test your suggestions, Olliver, thank you very much ! :D
I'll also try small tests as suggested by Djes, to better understand how it works.

Nicthequick wrote:
Try to store data in continuous memory blocks instead of linking small bunches of memory together like LinkedLists do. This way cache misses are way more infrequent.

I checked every condition, except this one but I did my best to solve it.
Hastable need either to allocate linkedlist elements or to copy and reallocate the whole table if it gets full (open adressing).
So I just use linked chunks of memory. As clusters, they are a whole contigous memory area and as linked, I just allocate a new cluster when one is full.(or load factor is up to a value...)

_________________
There are 2 methods to program bugless.
But only the third works fine.

Win10, Pb x64 5.70 LTS


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye