Monthly Archives: September 2008

Porting your programs to PureBasic 64bit

The first public beta of PureBasic 64bit is out today, so the quick ones of you will soon start porting their projects to it. Here are a few tips from our own experience of porting the PureBasic tools to x64:

1) Port your project first to 4.30 x86 and fully test it there before moving on

The 4.30 update introduces a number of incompatible changes (many because of the x64 introduction) which are not all visible as compiler errors when used the old way. So instead of jumping directly to the x64 version and thinking about wrong variable sizes etc, better first make sure the crashes you get are not caused by one of these changes and are totally unrelated to the fact that your program is a 64bit one now. If you plan to release both a 32bit and 64bit version of your program, this is a logical first step anyway.

The most important change in this respect is probably the Read command. To avoid problems with different variable sizes on x86 and x64 with Read (which could lead to hard to find bugs), we decided to change the command to not get its type from the variable but rather like Procedure and others from a type postfix. This way you can match the Read type with the type in the DataSection, independent from the type of the variable you read the data into (you just cannot read a float into an integer type directly, but reading a byte into a quad works for example). Unfortunately, this change means that a 4.20 program which does “Read a.b” will now read an integer (4/8 bytes) as that is the default type if no postfix is specified. So our decision was to have this problem now once, or for all eternity when developing a program in both 32bit and 64bit. We decided to go with the former. The best and most foolproof way to deal with this is just to do a “Find in Files” on all your code for the Read command and adding a postfix for everyone of them.

Another source for a possibly hard to locate error is the change in LinkedList commands. They now return the element pointer, and no longer the list header pointer. So if you used the returnvalue of AddElement() and the like, you should check them all and make the needed changes.

Less problematic (because easily visible) changes are the ComboBoxGadget() height parameter, and the removal of the ButtonImageGadget() backward compatibility (you now need SetGadgetAttribute() to change the image). All other incompatible changes should raise compiler errors or warnings, so these are easy to spot.

Once all this is dealt with and your program runs fine with 4.30 x86, its time to dive into x64 programming…

2) .l is evil… very, very, very evil!

Many people have the habit of appending .l to every variable, even though this has been the default type all the time. Well, now is the time to stop that practice. Not only will the speed be worse using a too small type on x64, you also need to constantly worry what you store in the variable, as many things need 8 bytes for storage on x64 (pointers, handles, #PB_Any values, etc). Do not try to get guided by the urge to “save memory” here. 64bit systems have a ton of memory, and for normal variables this is really no argument.

From now on, you should treat the long type as you treated words and bytes so far. Use them only when really needed inside structures, or for Arrays/Lists where memory concerns really become an issue. For every normal variable just leave out the type (which will default to .i), or if you really cannot shake the habit, use .i instead. Just do this consistently, even for counter variables and the like. This will save you a ton of trouble, trust me. Truncated pointers are the worst kind of bug to try to track down.

Another good habit is to simply use a real *Pointer instead of just an integer variable whenever a pointer is stored. This is no requirement, as the .i type will work just as well. But in the long term this increases readability of the code in my opinion.

3) Working directly with memory – keep the pointer size in mind

When working with raw memory, we all probably used “*Pointer + 4” at one point to skip a pointer or long value. You now need to keep in mind to us 8 here for 64bit mode. Using any fixed numbers is discouraged here. Either use the SizeOf() compiler function, or define yourself a nice #PointerSize constant to make this very transparent.

4) API stuff

The API is quite transparent between the 32bit and 64bit versions. We updated all API structure definitions to have the correct types, so most API code should work on x64 out of the box.

Some things need to be considered though:

  • A number of API types also switch sizes, such as HANDLE, LONG_PTR etc. However, a number of types do not change size, such as DWORD, LONG, etc. For “in”-Parameters that are passed byvalue to a function, this is not a real problem, but if you need to pass a pointer to the variable to receive some data you need to check which type is needed.
  • Do NOT use Get/SetWindowLong_() for subclassing anymore. These remain 32bit even on x64. Use Get/SetWindowLongPtr_() instead, which works correctly everywhere. There is really no need to use SetWindowLong_() over SetWindowLongPtr_() actually, so best do a search and replace and replace all of them right away.
  • If you happen to use some API structure that is not predefined by PB, be careful about the padding. There is a lot more padding involved in the x64 structures, and the padding behavior is a bit complex to be explained quickly. If possible, just check the PB structure size against the output of a C compiler like VC8 (the 64bit compiler is provided with the Windows SDK) or ask on the Forum if you are unsure about the makeup of the structure.
5) Other

There is not much more too it. Calling the PB functions is straight forward on x64, as long as you do not use a too small type to store the return values (#PB_Any etc).

If you use ASM code, the x64 fasm commands can be used on the x64 version without trouble. The #PB_Compiler_Processor constant is a good help here to have separate CompilerIf sections of code for this.

All in all, the x64 port of the PB Programs like the IDE was much easier than we expected. If you get the variable types right, its pretty much done. Now good luck with your x64 ports.

The killer StatusBar

There are good bugs and there are bad bugs and a few are just downright nasty. We had some of those this week. So instead of our planned schedule, we had a fun little “bug week” digging through endless valgrind output trying to make sense of it all.

The good bugs are those where the symptoms directly provide a clue as to where to start looking. The bad bugs are those where you start with no clue at all, but it can be narrowed down eventually. The nasty bugs are those where the symptoms point into the entirely wrong direction, so you spend the time investigating totally unrelated code, second guessing your own sanity all the way until you stumble on the solution more or less by accident. The mother of all nasty bugs is one where in the end, it turns out the cause for all of the mess was something very simple and trivial. We had one of those this week.

The symptoms: The Debugger on Linux suddenly failed to work on KUbuntu with PB 4.20. Running a program with debug output would cause a crash with an X error. The weirdness part was the fact that it worked fine from a root account.

The trackdown: The root acount thing was the most puzzling symptom. It suggests that there is some form of access rights problem which in itself is weird, as the debugger does not do anything that could require special rights. Because of the X error, i suspected some gtk problem, so i fixed all gtk warnings and errors given while running the PB IDE. This took a while and is almost a story in its own right, but it was not the cause for the crash. After a search in this direction turned up nothing i turned to valgrind in the hope of finding some clue as to where the crash comes from. Here it got even more weird, as the valgrind output on KUbuntu was an endless list of “invalid read access beyond end of buffer” errors (which are usually serious), where the output on Suse was almost empty. After another few hours of search, i tried the same on a regular Ubuntu and got the same errors but no crash. We never figured out what these were, but they appear to come from some external library, and they did not cause the crash, so its not really our problem. Using the valgrind thread analysis tool was an intresting exercise (the debugger does a lot in threads), but lead to nowhere as well.

Long story short, in the end i finally managed to track it all down to a HideWindow() call (to show the debugger window), so it must have been some visual element on that window. From there it was just a comment/uncomment test to find the part that caused it all.

The actual cause: In the end, it turns out the tiny factor that led to all this mess was the StatusBar of the Debug output window. More specifically, AddStatusBarField() was called with a too high value. As there was (so far) no way to have a statusbar field go all the way to the right of the window, a common way was to just call AddStatusBarField(9999), which should be bigger than any window ever is. This worked ok so far, but in this call, there was 99999, and appearently that was too much. Some kind of allocation failed and caused the X error. To solve this for the future, you can now call AddStatusBarField(#PB_Ignore), and it will be sized with all remaining free space. There is also a check in the command now against too large fields.

This is somewhat disappointing. If you work on a bug for a long time, you want it to atleast be something significant. Some big thing that you finally figured out. Not such a small and insignificant function call. Its an awefull lot of time to waste, just because of an extra ‘9’. Well, think i’ll get over it 🙂

In addition to this one, we also had the fun of dealing with a heisenbug . A crash in the debugger that just disappeared as soon as we inserted printf() statements to try and narrow it down.

Anyway, we are back on track now. The (hopefully) final round of alpha versions is out to the testers and if all goes well, you can expect a public beta quite soon. (As usual, we’re not setting any dates though. You never know what comes along)

LinkedList memory management

For 4.30, i worked a bit on LinkedLists, and how they allocate their elements. The original motivation was to ensure that linkedlist elements are as close together in memory as possible, limiting the penalty of page faults and cache misses when accessing/sorting large lists. In the end, the benefit is also for the more normal cases and not just these large extremes.

The obvious solution is to allocate the linkedlist elements in larger blocks, but this requires another level of memory management above that of the OS. What i have learned in my past encounters with memory management is that the OS memory management is very hard to beat or even come close to in performance, given how complex the task of managing memory is. However, with linkedlists there is a fact of which the OS functions (having to work in all scenarios) cannot take full advantage: A linkedlist needs a large number of small, equally sized memory blocks.

Lets start with some statistics:

My system runs Vista64 with 8Gb of RAM but i did my test with PB 32bit. So the limits described are purely due to the limitations of the processes own address space, and not due to overall system memory. I tested with a list of type Long. This is a very small example, but not a very unlikely one at that. So including the linkedlist pointers every element is 12bytes in size. I tried to squeze as many elements into the list as possible before getting a crash and then did some statistics on the heap. The results with a plain 4.20 final are this:

List elements:   40000000
Blocks total:    40000000
Size total:      457.76 Mb
Size average:    12 Byte
Overhead total:  457.76 Mb

First thing to notice is the heap overhead: It is exactly the same as the total amount of allocated memory. This is because the heap requires 12byte on average to manage an allocated memory block. So this is a very bad usage/overhead ratio. The second thing to notice is that 450Mb is not much usable memory, given the fact that the process should have almost its whole 2Gb of available user memory space for this list. This suggests a huge amount of fragmentation, filling up all space with not even 500Mb of usable data.

So whats the better solution ? It is to allocate the memory in larger blocks, with a minimum possible amount of extra management data per element. Also a free element must be found very fast. If there has to be a search for a free spot it will kill all performance. I ended up with a quite good solution requiring only 4byte of extra data per element, and findining a free element as well as freeing an element in O(1), meaning without any kind of loop to find the free spot. I won’t bore you with the details here. What i ended up is actually not a new idea. It is in fact quite old. Had i found the following paper earlier, i could have saved myself some thinking :-). If you are intrested, the paper is available here. My implementation is a bit different from what the paper describes, but the basic idea is the same.

So, here is the result from 4.30:

List elements:  120000000
Blocks total:	1977
Size total:    	1.79 Gb
Size average:	948.57 Kb
Overhead total:	15.55 Kb

So we can see: Many more elements fit into the list, with much more really used memory. The overhead is very small now because the average block size is about a Megabyte and there are not that many blocks because of that. But there are even more benefits from this:

  • Allocating/freeing elements is faster, because the number of alloc/free calls is greatly reduced.
  • Walking the list (and also sorting it of course) is faster, although not as much as i had hoped. This effect should be more visible on systems with less physical memory available, where the OS is forced to swap pages out to disk more often.
  • ClearList() is mighty fast now, as it does not have to walk the list to free all elements. All it has to do is walk the list of blocks and free those. (so in the above case thats 2000 operations instead of 120000000). A tiny exception here are string lists, as the list must be walked to free all strings anyway.

So whats the catch ?

Yes, there is a catch. There always is, but in this case the benefits much outweigh the costs. The problem is that the list will have unused but allocated elements when not all blocks are full. To limit this, allocation blocks grow dynamically depending on the list size, so the overhead of unused but allocated elements is very small compared to the overall list size.

The bigger problem arises when freeing part of the list. A block can only be freed when all its elements are unused. Consider the (highly unlikely) case that i use the list above, then free all elements but 2000 of them and unlucky as i am, there is exactly one element remaining on every single memory block. This means no memory is freed at all. This sounds like a big problem at first, but really its not that big of a deal. When allocating new elements, every available space is used first, so when removing/adding elements on a frequent basis (which is usually the case) the number of unused slots will still be very low. The only bad case is the above: making a list very big, then deleting random elements (not all) and then never touching the list again. This is a very unlikely case. Consider also this: In the stats from 4.20, 50% of the used memory was heap overhead. So even if now almost half of the allocated elements are not even used, you are still better off than before because the memory fragmentation is greatly reduced.

So overall i think this is a great improvement. Something to look forward to.