Category Archives: Libraries

Library development

New PureArea.net Interview with Fred and freak released

Hi folks, Happy New Year 2016!

I’m proudly presenting a new interview I made with the main PureBasic coders Frederic ‘alphaSND’ Laboureur and Timo ‘freak’ Harter.

You can read it on my website www.PureArea.net, just go to the ‘News’ section of 4th January 2016.

I hope you enjoy the reading of this (like I find) interesting interview!

(Direct link to the interview)

Enter GTK3

Since PureBasic 5.40 LTS, GTK3 is the default GUI toolkit used by PureBasic on Linux. It seems like a no brainer update for our users, but the fact is than GTK3 doesn’t fit well the PureBasic GUI library and we have several bad hacks here and here to make it work at all. We decided from the start in PureBasic to handle pixel perfect sizing of the GUI elements to allow maximum flexibity. For example, we added a Dialog library on top of it, which adds automatic layout depending of the font size, and allow to design complex resizable windows. It’s nice, but not mandatory, and depending of your needs, you will or will not use it. Up to GTK2, it was the same: you could define pixel perfect size and the toolkit would honor it. Now, it’s a different story: the toolkit can decide if the size you gave programmatically is too small or not. It’s a big change, as you could end up with a deformed GUI and more important, the running program doesn’t reflect the code entered by the programmer. You may wonder why such change has been implemented ? My guess is to force use of layout for everything, so the toolkit can support themes more efficiently and without change from the programmer. It may be true, but it’s one of the first toolkit to do so and forcing the programmer hand isn’t always a great thing, especially when it removes flexibility: you can write a layout engine using pixel perfect sizing, but you can’t do pixel perfect sizing with a layout engine. And for some specific apps, running on proprietary OS, it can be an issue as theming is irrelevant.

The cost of empty lists

So now we have the ability to embed lists, maps and arrays inside structures. Its a very useful feature and we can already see people building crazy data structures with all sorts of nesting in them, even tree like structures. I want to talk a little bit about the costs associated with these objects. The fact is, even empty lists or maps take up memory and if you create a lot of them, you should be aware of that. The sizes i give below are all for the x86 versions (so 4 bytes per pointer). For x64, you can usually double that. Not all fields are actually pointer sized, so it is less than that in reality, but it is a good approximation.  My base scenario is an array of a structure type with an embedded list. However, this applies to all other combinations as well.

Structure MyStruct
  Something.i
  List Elements.s()
EndStructure

Dim Table.MyStruct(99)
Automatic initialization on creation

First of all, its important to realize that structure content is automatically initialized on creation. This means that if you create a structured variable with an embedded list, you don’t have to call NewList in order to use the embedded list. However, this also means that simply by the process of creating the structured variable, you also created a (still empty) list which takes up some memory. The same thing applies to my example array: By creating the array with 100 elements (0-99), i also created 100 empty lists.

You could say this is a waste of memory, and you would not be entirely wrong. After all, if not all elements of this array are actually used to store data, that is a lot of unused memory. We discussed this at length when implementing the feature. An alternative approach would have been to leave all such dynamic elements uninitialized on creation and require the user to explicitly call NewList before using them in each element. This would conserve memory when not all array elements are intended to have items in their lists, but it puts the burden to ensure that all dynamic elements are initialized before they are accessed on the user and introduces a whole new source for bugs. Its also extra work for the user considering that if people define such a structure they intend to put things into it, and forcing them to do an extra initialization step each time when PB can easily do it for them is just wasting the programmer’s energy.

So in the interest of having a really easy to use feature we decided to go with automatic initialization. After all, if you really need the full control and do not shy away from some extra work then building your own linked data structures in memory is not that hard. If you want something simple that just works, then this feature is for you. Of course its never wrong to know about the costs so you know what happens when your data set gets large.

Memory usage

How much memory does an empty list take? Aside from the actual space it takes up inside the structure which is 2 pointers for a list (1 pointer for array and map), there is a header structure which is allocated as well. It stores information about the list such as its type (for sorting and cleanup), the pointers to the first and last element in the list as well as cached index information for a fast SelectElement() if possible. The list header is currently 48 bytes on x86. As described in another entry, we also have custom memory allocation for list elements for faster allocation/freeing of memory. These allocation routines also take up 24 bytes of header data per list.

So in my list in array example above i have created 100*12 bytes of actual element data in the array plus another 100*72 bytes for empty lists. While this extra data is still not very much, it is considerably more than the array element data itself, so it will become an issue if you make your array very large. If you knew for example that you will never have more than 18 elements in the embedded list, then you could replace the thing with a static array inside the structure with 18 elements and still have the same memory usage.

Empty maps have a much larger impact. They also come with a header and an allocator header, but an empty map also reserves space for its hash table even if there are no elements in it. The default table size is 512 entries, so that is another 2048 bytes for an empty map on x86.

Arrays are a little different as you can declare inside the structure how many elements it should have on creation. Arrays also come with a header, but nothing else. Here its more important to keep in mind that if the elements of the embedded array in turn contain embedded lists, maps or arrays that you have to take their costs into account as well as they will be initialized too when the embedded array is created.

One final thing to keep in mind is that lists and maps actually allocate the memory for their elements in blocks of 16 elements minimum (for speed and to reduce fragmentation), if you create many lists with just a few elements in them (say 2 per list for a binary tree structure), you end up with used memory of 16 elements per list instead. Note though that this is just allocation, not initialization. The used memory will be only for the 16 elements themselves, not for any contained dynamic list, map or array they may contain.

Bottom line

This post should be seen in the right context. The cost of the embedded object may seem large compared to the size of the structure they are put in, but that doesn’t change the fact that they are still very cheap objects. Lists, maps and arrays are all memory-only objects. This means that all they need is memory, no handles, no system objects, nothing else. Memory is not much of a concern these days, so you can create many of them and even create/destroy them frequently. This really isn’t a problem. Take my above array example: Even if we Dim the array to 10000 elements, we still only used up 820Kb of memory which is not much by today’s standards.

The only place where you really need to concern yourself with this is if your data sets get really large and memory does become an issue. In this case the advice is simple: Avoid “sparse” data structures where a large part of the dynamic elements are allocated but not used. You can also call FreeList(), FreeMap(), FreeArray() or ClearStructure() on any element you do not need. You then have to re-initialize it if you want to use it later with NewList, NewMap, Dim or InitializeStructure() respectively.

Always keep in mind the quote from Donald Knuth on optimization:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

In this context, this means not to worry about this issue when designing your data structures. Do not get tempted to optimize away a few bytes of extra memory with funny tricks. A clear and simple layout of the data in your program is worth much more than that (and will save you many headaches from chasing bugs). Only if you see that your program really has a memory problem is it time to start thinking of ways to cut down on overhead like this.

In this spirit, have fun designing crazy nested data structures with the new dynamic elements! 🙂

2DDrawing in v4.40 under the hood

A lot changed in the 2DDrawing library with the v4.40 version. In fact, it basically underwent a complete rewrite. Only very few code parts remain from the previous version. Since it has come up in various questions and bug reports on the forum i am going to explain the new design a little bit and what consequences it has on stuff like mixing API and PB commands for drawing.

Lets start off with what the 4.31 library looked like:

The Windows version of the library was based entirely on GDI (the Windows drawing API). GDI allows output on a window, image, printer, sprites and the screen with the same commands so for the most part only the code that set up the drawing needed to be specific and the rest was shared (hence the separation of the output functions like ImageOutput() from StartDrawing()). One detail that a lot of users relied on was the fact that the GDI handle to the drawing context (HDC) was returned for StartDrawing() which allowed to easily mix API and PB commands in the drawing operation even without much knowledge of GDI. The internal data structure of the library was published a long time ago in the PB Library SDK and since the library didn’t change much over time there is still some code that relies on them.

Linux was a different story. Here every output requires its own API (Gtk for window an image, libgnomeprint for printer and some pixel based routines for SDL output). So the Linux library had a plugin like system just like some other PB libraries have to select the right command depending on the output. The Mac OSX version basically had only one output for images. The printer library created a temporary image and sent it to the printer when drawing was done. There was no drawing for windows, sprites or the screen on OSX. The drawing functions shared the pixel based drawing code we used for SDL on Linux.

Why the library was rewritten:

The reason was the alphachannel. As explained in a previous entry, i worked on improving the capabilities of PB when it comes to displaying alphachannel images. After that was finished i felt that at least some support for 2DDrawing was needed in addition to the DrawAlphaImage() command as well. The problem here is that GDI does not support the alphachannel even though it can handle 32bit images. Since Windows 98 you can load images with alphachannel and display them with the AlphaBlend() API but that is pretty much it. Any manipulation of the image with GDI will cause the alpha information to be lost. The same problem exists with Gdk on Linux. The ony drawing functions that exist work on GdkDrawable objects and these do not support the alphachannel. GDI+ is the Windows replacement for GDI which can deal with the alphachannel, but we needed a crossplatform solution to this problem. So we made the decision to create our own drawing functions similar to the pixel based ones for SDL and OSX which could handle the alphachannel in all drawing commands. As you can see from the result i went a step further and added the gradient drawing support (which also would not be possible with the API functions).

The new design:

Since our pixel based routines can only draw on images we now need separate “subsystems” for drawing even on Windows. So now we have this plugin-architecture on all OS. The real drawing functions are called through pointers which are set when the drawing is started so the speed impact is minimal. For those concerned about executable sizes: A drawing subsystem is only linked when the corresponding output function is used in the code. So if you do not use ImageOutput() none of the image drawing code will be linked.

So there is now a GDI subsystem on Windows and the new “PixelBuffer” one. The PixelBuffer subsystem does all its drawing directly on the memory of the output image with no API involvement. The only exception is the drawing of text. We did not want to add the overhead and license troubles by including an external font rendering engine like freetype so DrawText() uses GDI to render the text to a special device context from which the text is then drawn to to the real output with the alphachannel and gradients taken into account. It works in a similar way on the other OS where the native API is used to render the text and then it is copied to the real output. There is of course a speed penalty from this, but it cannot be avoided and it is questionable how much faster a separate rendering engine would be with this task.

Things to watch out for for on Windows:

I tried my best to keep the library backward compatible. If you use PB drawing commands only then there should be no difference at all. If you mix them with API then there are some things to watch out for:

  • As the “HDC result from StartDrawing()” is very commonly used i kept that behavior the same. So even though the PB commands do not use it for the drawing, there is still a device context created and returned from StartDrawing() which you can use to draw on the image with API.
  • You can still mix API drawing functions with PB functions on the same output without problems. The only thing that changed here is that GDI functions that change the global state of the device context no longer affect the PB drawing functions like they used to because the PB functions do not actually used that device context. So a function like SetWorldTransform() will not cause DrawImage() to draw a rotated image anymore. However it should still work if you use the BitBlt() API.
  • You have to be aware that GDI functions will erase the alphachannel. So if you draw on a 32bit PB image and use a GDI function, the modified area will look transparent after that. The best way to avoid this problem is to use 24bit images as output when API commands are involved. (We changed the default image depth to 24bit in 4.40 beta2 to make the transition easier)
  • PB Images are now always DIBs (device independent bitmap) to allow for the pixel based drawing. The #PB_Image_DisplayFormat flag used to create a DDB (device dependent bitmap), but this is no longer supported. #PB_Image_DisplayFormat now has the value 32 to create a 32bit image instead. Some GDI functions (like GetDIBits()) expect a DDB, so you may get trouble there.
  • DrawImage() can still draw API created images (including DDBs and icons). Again you have to be aware that a 32bit API created bitmap will probably have all alpha values as 0 which PB will interpret as fully transparent. Using 24bit is the solution here as well. Also DrawImage() expects DIBs to be bottom-up (the default on Windows). If you use top-down DIBs then they will be drawn upside down. There is no way to avoid that as Windows does not provide a way to find out the pixel orientation of a DIB.
  • If you relied on the library’s internal data structures then you are out of luck. Nothing is as it was before on the library’s internals.

To sum it up: Stick to 24bit images and in most cases mixing PB drawing with API drawing should work just as it did before.

Things to watch out for on Linux:

I am not aware of code that mixed 2DDrawing with API on Linux. StartDrawing() used to return the GdkDrawable handle for Image+WindowOutput() before. It now returns that only for the WindowOutput() as there is no more drawable when drawing on an image. Backward compatibility wasn’t possible here.

One thing to note in general is that PB images are now GdkPixbuf objects and no longer GdkPixmap. This is a big improvement as many Gtk functions expect GdkPixbuf nowadays, so it is actually easier to use PB images with Gtk functions.

Things to watch out for on OSX:

Nothing to say here. Since the 2DDrawing support was so poor before, it only got better with this release. There is now a separate QuickDraw based subsystem for WindowOutput(). Yes i know that QuickDraw is deprecated, but it was much easier to implement this way than to go for a full Quartz based one. I have plans to do a Quartz subsystem one day to have better PrinterOutput() (for the moment, it still works with images as an intermediate).

One thing to note is the new output for sprites and the screen with OpenGL (this applies to all OS): There is no way to modify the data in video memory, so if you call StartDrawing() the whole sprite data or the entire screen content is transferred to main memory for the drawing. This is very slow. So doing 2DDrawing everytime you refresh the screen is too costly. The better way is to draw on sprites just once and then display them over and over. Still, this is better than having no 2DDrawing for sprite/screen at all.

Future plans:

There is still room for optimisation in the new 2DDrawing code. For now the focus was on getting a stable implementation. Also the plan is to eventually have the alpha drawing routines also for SpriteOutput() and ScreenOutput() but this will need some more work to support all possible pixel formats for the screen. Stay tuned…

Alpha-channel improvements

Alpha-channel improvements on Windows

Alpha-channel improvements on Windows

This one is for all those who are quick to complain when a new feature is added that they themselves don’t need. As it turns out, sometimes changes made in areas you may not care about at all can trigger changes in places that you may find useful after all. One of these areas is the IDE.

It all started very small and insignificant: Inspired by the recently reanimated thread here, i decided to add the ability to easily switch out the IDE provided icon set for a custom icon theme. How hard could this be? It was a rather small thing really, and quite quickly implemented in the IDE code. Here is where you would say: “Why do i need themes? Better focus on something else!”, well better read on…

Of course i had to test this, and for that purpose i created a theme with the Silk icons also mentioned in that thread (the result is quite nice btw, and will be included in the PB package). They use a lot of semi-transparency and converting them to icons (non 32bit, for older Windows versions) just didn’t look good. So i tried loading and using the png files directly. Well ups, the PB ImageMenu doesn’t support the alpha-channel. In fact, it doesn’t even support non-icon images at all. So i’ll just change the drawing code to support alpha-channel images, how hard could this be?

Those who know me a little closer know that i tend to think way too far ahead with these kinds of things. What good is alpha-channel support in the menu when not even the ImageGadget can properly display alpha-channel images? So lets just add it there as well, and then there is the ButtonImageGadget too. Seriously, how hard could this be? 🙂 (turns out this one was actually a tough one. The nice solution i had worked up for Vista/XP just didn’t want to work on the older Windows versions)

Alpha-channel improvements on Linux

Alpha-channel improvements on Linux

The line doesn’t end there though. There is also the Linux version. The Image lib on Linux didn’t support the full alpha-channel so far, it only supported full transparent pixels through a 1-bit mask. And there i am, right back at the beginning with a set of ugly looking icons in the menu. So, lets just add the full alpha-channel to the Linux Image lib. How hard… well, this really is a bigger deal. It means a rewrite of the entire lib, plus large parts of the 2DDrawing lib and every other command that deals with images. We had this one on our list for quite a while already though because the lib still uses a GdkPixmap to represent an image and not a GdkPixbuf which is the much better choice. So i went ahead and rewrote these commands as well, and it was worth it. The lib can now use the full alpha-channel and since the Gtk widgets all bring support for GdkPixbu’s, every Gadget that deals with images can use it too. (for example ListIcon, Tree etc). Oh and while i was at it, i dived a bit into SDL and implemented DrawAlphaImage() for Linux too. 🙂

This brings a new “problem”. With Linux supporting the alpha-channel in every GUI element, the Windows version is lacking behind again because it still requires Icon files for some GUI things. That cannot stand! XP and Vista actually solve this quite well with their 32bit icon support so this wasn’t so hard to do. Older versions however don’t support that, so to get the best possible result an alpha-channel image is converted to an icon with a mask to get at least some transparency. You can now pass non-icon images to all functions that take an Image on Windows too and it will be converted to an icon as needed.

For once, the Mac version was the easy target in all this because it already came with full alpha-channel support from the start. It only lacked the ImageMenu, but i added that as well.

So, with v4.40 you will be able to use alpha-channel images for all GUI elements and it will look good on all OS. No more converting to ico to get transparency on Windows! If you plan to target Windows versions older than XP, you might want to avoid semi-transparent pixels though for things like toolbars so it will look good also without the 32bit icon support. If you plan to target Windows 95 or NT4, you are out of luck as they don’t have alpha-channel support at all. 😛

Not so useless after all, eh?

New debugger features for userlibraries

A number of new functions were added to the DebuggerModule.h file in the SDK with 4.30 and since there is no documentation for them except the header file itself, i am going to explain them here briefly. They are mostly about localizing the debugger messages for the libraries but there are also some useful functions to validate the function parameters. I am using C to explain things even though most userlibraries are probably created using Tailbite, but C is what this stuff was written in/for and not everything can be directly translated to PB code (like the variable arguments functions). If you use TailBite you’ll have to use some tricks to make it all work. Note that all Message and Key parameters take ascii strings as input, even in unicode mode.

Error reporting:

In addition to the PB_DEBUGGER_SendError() which was there before, there is now also PB_DEBUGGER_SendWarning(). It works just like the SendError one, but issues a warning and not an error. Warnings are handled depending on the level that the user chooses. They can be ignored, just added to the log or handled like an error (program stop etc). The intention is to report anything that is probably wrong, but could also be valid input as a warning so the user is not constantly annoyed by errors that are none. The PB libraries currently report the following things as warnings:

  • Missing files for LoadImage(), LoadSprite(), etc. This often indicates that the user just mistyped the filename, but it is also perfectly valid to use these functions to try to load a nonexisting file and handle the failure properly.
  • Passing something that isn’t a PB procedure to a function that expects a callback. This helps catch cases like CreateThread(@Function, 0) where the ‘()’ is simply forgotten which is a very common mistake. It is however perfectly legal to pass a pointer to an external (maybe dll) function or label here, this is why this is not an error. A 0-pointer or a procedure pointer with the wrong prototype are however handled as errors, because those clearly are not valid input.
  • If OnError functions are used with enabled debugger, or with disabled OnError lines support.
  • On Linux: Gtk warnings/errors. Those mostly do not cause a program to fail, that is why they are just warnings. The debugger catches these since 4.30 because this way you can get a PB linenumber for the warnings which was not possible before.

There is the SendDebuggerWarning macro to keep the same style as the SendDebuggerError() before.

Parameter validation:

PB_DEBUGGER_CheckLabel() takes a pointer and returns true if the given pointer represents a label in the code. This does not detect labels in direct Asm, so there should be only a warning if this check fails.

PB_DEBUGGER_CheckProcedure() can check if a pointer is a procedure in the code and if so, wether the procedure matches a given prototype. This is a vararg function and takes the following arguments:

  • The pointer to check
  • The expected return type. The type is specified using the values of the PB type constants (#PB_Long, #PB_String etc). They are defined for C in the PureLibrary.h header.
  • The expected number of arguments to the procedure.
  • The type of each argument, the same way as the return type.

The possible returnvalues are listed in the DebuggerModule.h and indicated wether the function was found and if the prototype matches. Note that the automatic promoting of parameters into account. For example a word is always expanded to a long on the stack on x86, so passing PB_Word as expected parameter will also match a procedure that has a long at this parameter position (as the stack layout is the same). As mentioned above, a mismatching prototype or 0-pointer should be seen as an error, but a nonzero pointer that does not point to a PB procedure may still be valid, so better just issue a warning there.

PB_DEBUGGER_FileExists() was there before, and simply checks if the given file exists. The input to this has to be a unicode string in unicode mode unlike the other functions.

To make the list complete, there are also the PB_DEBUGGER_FullScreen, PB_DEBUGGER_Unicode and PB_DEBUGGER_Thread variables to indicate the settings/state of the program. There is also the threadlocal structure PB_DebuggerGlobals which currently only tells you if the program is in a StartDrawing() block. To get this info, use PB_Object_GetThreadMemory() from the Object.h with the PB_DEBUGGER_Globals variable to get the actual pointer to the threadlocal structure. All this stuff was there also before 4.30

Localisation of debugger messages: Using the common error messages

There is a list of very common error message for checking input parameters. These can be used within your library even without providing your own translation files. You can see the possible messages in the Debugger.Catalog file in the PureBasic folder in the [Common] section. The functions to access these are:

char *PB_DEBUGGER_GetCommonLanguage(char *Key, ...);
void  PB_DEBUGGER_SendCommonError(char *Key, ...);
void  PB_DEBUGGER_SendCommonWarning(char *Key, ...);

As you can see, they are vararg functions. They work in fact just like printf(). Most common messages have placeholders for strings/numbers and you have to pass arguments to fill them. The GetCommonLanguage() just returns the translated language string, the other two directly send a warning/error. There are again the SendCommonError, SendCommonWarning and GetCommonLanguage macros for better readability.

Localisation of debugger messages: Using custom messages

You can also provide your own messages and your own Catalog files. This involves the following steps:

First you have to include the information about your language data and the default strings with your debugger functions in the form of a PB_Language structure. Here is an example:

static PB_Language LanguageTable =
{
  "Libraries.catalog",
  "PB_Libraries",
  "OnError",
  0,
  {
    "DebuggerPresent", "The OnError library may not catch all errors with enabled debugger.",
    "LinesDisabled",   "The OnError lines support is not enabled.",
    "NoHandler",       "This function can only be called inside an eror handler.",
    "", "",
  }
};

These are the fields:

  • Default Filename for your catalog file (without path). This filename is checked first, but if it cannot be loaded, all Catalog files are searched for the correct data. So this is more of a hint.
  • The value of the “Application” key in the [LanguageInfo] group inside the Catalog file. This is what truely identifies your Catalog, so it has to be unique.
  • The name of the group inside the Catalog file in which the following Keys are located. Note that the group names of all libraries share the same namespace, so group names have to be unique too. Either use your library name here (as the PB libs do), or prefix your names with something unique. Multiple Libraries can share the same Catalog file (as the PB libs do), but one PB_Language structure has to correspond to one group. If two libraries refere to the same group, their values will overwrite each other.
  • The ‘Initialized’ field must be set to 0 initially. The debugger uses this to track wether it already loaded this language data or not.
  • Following is the list of Key-Value pairs with the default language. The list is terminated by two empty strings.

Now you can use this defined language (even without any actual Catalog files). Like with the common language, there are 3 functions to get the translations:

M_PBFUNCTION(char *) PB_DEBUGGER_GetLanguage(PB_Language *data, char *Key);
M_PBFUNCTION(void)   PB_DEBUGGER_SendTranslatedError(PB_Language *data, char *Key);
M_PBFUNCTION(void)   PB_DEBUGGER_SendTranslatedWarning(PB_Language *data, char *Key);

These are not varargs, and don’t support the printf-like syntax. The reason is that we access them mostly through macros, and support for varargs macros is not very good among C compilers. The first argument is a pointer to the PB_Language structure and the second one is the language key to look for. If you use C and name your PB_Language structure ‘LanguageTable’ as in the example above, then you can use the GetLanguage, SendTranslatedError and SendTranslatedWarning macros to leave out the pointer argument for simplicity.

If all this works, you can start creating translations in Catalog files. The format is the simple preference format. Just look at the Libraries.Catalog for an example. The [LanguageInfo] group and its “Application” and “Language” keys are mandatory, the other fields in [LanguageInfo] are for information only and not used by the debugger. Place your catalogs in the PB folder where the other ones are.

The actual language that will be used is determined by the compiler or ide, not the debugger. You can set it for the commandline compiler with the /LANGUAGE switch. The ide sets the language to the one it uses itself.

Parallel frenzy

Ever since i got a QuadCore, i have been looking for ways to make it simple for people to make use of this extra power in today’s computers. Only a handful of programs available today can really use this power. Thread programming is not easy. It is hard to do it without introducing hidden bugs, and as i found out myself it is even harder to do it in a way that fully utilizes the available resources and does not spend all its time waiting on synchronisation points. Writing a piece of code that can max out 4 cores and still perform something useful is quite a challange 🙂 And if trends continue, even 8 core machines are not unimaginable.

So how do we make this simpler ? Even if writing a fully threaded program remains a challange, maybe we can introduce commands that make it simpler to parallelize smaller portions of the program. The most obvious place to start are the sort commands, as these contain CPU intensive tasks which can be parallelized well. So i wrote some parallel sort routines. The implementation can scale up to as many cores as are available. So in 4.40, at least the sorting commands can utilize the full power of the CPU. I wrote a small framework to make parallelizing PB commands easier, so other commands may benefit as well. (although most commands are just not CPU intensive enough to make this useful).

SortArray() vs. parallel SortArray()

Here you see some test results for SortArray() on my QuadCore (showing the sort times for an array of longs). I tested the original version, as well as a 2 and a 4 thread run of the command to see how it would perform on a DualCore or a QuadCore. The
improvement of the parallel variants is quite constant. The 2 thread test took on average 52% of the execution of the serial version while the 4 thread version takes on average 31% of the time. The improvement actually increases a little bit the larger the array gets.

Even while this is a good improvement, why not more ? After all, 31% is only about a factor of 3 while we would expect a factor of 4 in the 4 thread variant. Some of this is due to the management overhead, but there is also a part which simply cannot be parallelized. Not all threads can get to work immediately. The first partition step has to be done in one thread, before 2 threads can get to work on the second step of the sorting (on each half of the data). Only in the third level (and all subsequent ones) can 4 threads even find work. This serial part is always there, and it limits the improvement that can be made by parallelizing the other part, no matter how many cores are present (this is called Amdahl’s law).

SortList() vs. parallel SortList()

The results for SortList() are not as good. Here the results vary depending on the list size, with a best of 55% but average of 71% for the 2 thread test and a best of 47% and average of 67% for the 4 thread version. Parallelizing a mergesort takes some more overhead than quicksort, as there is some waiting time involved until two smaller sort steps can be merged into a larger one. Since the larger sort steps come last and depend on the smaller ones (unlike with quicksort) this cannot be avoided. Also since a list is spread more through memory than an array, cache effects will play a role too. Still, both the results for SortList() and SortArray() are a good improvement in my opinion.

How hard it is to write code that fully utilizes the CPU power can be seen on the parallel SortList(). I went through a number of different implementations for this command. Trying to come up with more compact versions (which use less code or less memory reads/writes) only to find out that they performed worse than the original serial version due to cache effects. For example i found a mergesort code on the net witch uses only a handful of variables and a minimum amount of read operations. I implemented it in asm, using exclusively registers to store the variables. The instincts tell us that this is an optimal solution. But because of the way the code reads the list makes it read and overwrite values in the cache on every run. This makes it run slow as hell for large lists, even if executed on 4 cores. The key word here is “locality“. Memory locations which are close together must be processed together so that the values are still in the cache (or at least in many memory for large lists). The current implementation performs quite well in this respect.

By the way, some of this locality optimisation is already present in the 4.30 release. One of the reason why i wrote the block allocator about which i talked in my first blog post was to make sure that list elements are closer together. This way both the sort, but also a normal walk through the list can be executed in a more cache-friendly way. The block allocation was working in time for 4.30, the parallel stuff did not make it to that release.

We have some more new things planned related threads and parallel programming which should make it simpler to write threaded programs in PB, but the rest is a surprise… 😉

The killer StatusBar

There are good bugs and there are bad bugs and a few are just downright nasty. We had some of those this week. So instead of our planned schedule, we had a fun little “bug week” digging through endless valgrind output trying to make sense of it all.

The good bugs are those where the symptoms directly provide a clue as to where to start looking. The bad bugs are those where you start with no clue at all, but it can be narrowed down eventually. The nasty bugs are those where the symptoms point into the entirely wrong direction, so you spend the time investigating totally unrelated code, second guessing your own sanity all the way until you stumble on the solution more or less by accident. The mother of all nasty bugs is one where in the end, it turns out the cause for all of the mess was something very simple and trivial. We had one of those this week.

The symptoms: The Debugger on Linux suddenly failed to work on KUbuntu with PB 4.20. Running a program with debug output would cause a crash with an X error. The weirdness part was the fact that it worked fine from a root account.

The trackdown: The root acount thing was the most puzzling symptom. It suggests that there is some form of access rights problem which in itself is weird, as the debugger does not do anything that could require special rights. Because of the X error, i suspected some gtk problem, so i fixed all gtk warnings and errors given while running the PB IDE. This took a while and is almost a story in its own right, but it was not the cause for the crash. After a search in this direction turned up nothing i turned to valgrind in the hope of finding some clue as to where the crash comes from. Here it got even more weird, as the valgrind output on KUbuntu was an endless list of “invalid read access beyond end of buffer” errors (which are usually serious), where the output on Suse was almost empty. After another few hours of search, i tried the same on a regular Ubuntu and got the same errors but no crash. We never figured out what these were, but they appear to come from some external library, and they did not cause the crash, so its not really our problem. Using the valgrind thread analysis tool was an intresting exercise (the debugger does a lot in threads), but lead to nowhere as well.

Long story short, in the end i finally managed to track it all down to a HideWindow() call (to show the debugger window), so it must have been some visual element on that window. From there it was just a comment/uncomment test to find the part that caused it all.

The actual cause: In the end, it turns out the tiny factor that led to all this mess was the StatusBar of the Debug output window. More specifically, AddStatusBarField() was called with a too high value. As there was (so far) no way to have a statusbar field go all the way to the right of the window, a common way was to just call AddStatusBarField(9999), which should be bigger than any window ever is. This worked ok so far, but in this call, there was 99999, and appearently that was too much. Some kind of allocation failed and caused the X error. To solve this for the future, you can now call AddStatusBarField(#PB_Ignore), and it will be sized with all remaining free space. There is also a check in the command now against too large fields.

This is somewhat disappointing. If you work on a bug for a long time, you want it to atleast be something significant. Some big thing that you finally figured out. Not such a small and insignificant function call. Its an awefull lot of time to waste, just because of an extra ‘9’. Well, think i’ll get over it 🙂

In addition to this one, we also had the fun of dealing with a heisenbug . A crash in the debugger that just disappeared as soon as we inserted printf() statements to try and narrow it down.

Anyway, we are back on track now. The (hopefully) final round of alpha versions is out to the testers and if all goes well, you can expect a public beta quite soon. (As usual, we’re not setting any dates though. You never know what comes along)

LinkedList memory management

For 4.30, i worked a bit on LinkedLists, and how they allocate their elements. The original motivation was to ensure that linkedlist elements are as close together in memory as possible, limiting the penalty of page faults and cache misses when accessing/sorting large lists. In the end, the benefit is also for the more normal cases and not just these large extremes.

The obvious solution is to allocate the linkedlist elements in larger blocks, but this requires another level of memory management above that of the OS. What i have learned in my past encounters with memory management is that the OS memory management is very hard to beat or even come close to in performance, given how complex the task of managing memory is. However, with linkedlists there is a fact of which the OS functions (having to work in all scenarios) cannot take full advantage: A linkedlist needs a large number of small, equally sized memory blocks.

Lets start with some statistics:

My system runs Vista64 with 8Gb of RAM but i did my test with PB 32bit. So the limits described are purely due to the limitations of the processes own address space, and not due to overall system memory. I tested with a list of type Long. This is a very small example, but not a very unlikely one at that. So including the linkedlist pointers every element is 12bytes in size. I tried to squeze as many elements into the list as possible before getting a crash and then did some statistics on the heap. The results with a plain 4.20 final are this:

List elements:   40000000
Blocks total:    40000000
Size total:      457.76 Mb
Size average:    12 Byte
Overhead total:  457.76 Mb

First thing to notice is the heap overhead: It is exactly the same as the total amount of allocated memory. This is because the heap requires 12byte on average to manage an allocated memory block. So this is a very bad usage/overhead ratio. The second thing to notice is that 450Mb is not much usable memory, given the fact that the process should have almost its whole 2Gb of available user memory space for this list. This suggests a huge amount of fragmentation, filling up all space with not even 500Mb of usable data.

So whats the better solution ? It is to allocate the memory in larger blocks, with a minimum possible amount of extra management data per element. Also a free element must be found very fast. If there has to be a search for a free spot it will kill all performance. I ended up with a quite good solution requiring only 4byte of extra data per element, and findining a free element as well as freeing an element in O(1), meaning without any kind of loop to find the free spot. I won’t bore you with the details here. What i ended up is actually not a new idea. It is in fact quite old. Had i found the following paper earlier, i could have saved myself some thinking :-). If you are intrested, the paper is available here. My implementation is a bit different from what the paper describes, but the basic idea is the same.

So, here is the result from 4.30:

List elements:  120000000
Blocks total:	1977
Size total:    	1.79 Gb
Size average:	948.57 Kb
Overhead total:	15.55 Kb

So we can see: Many more elements fit into the list, with much more really used memory. The overhead is very small now because the average block size is about a Megabyte and there are not that many blocks because of that. But there are even more benefits from this:

  • Allocating/freeing elements is faster, because the number of alloc/free calls is greatly reduced.
  • Walking the list (and also sorting it of course) is faster, although not as much as i had hoped. This effect should be more visible on systems with less physical memory available, where the OS is forced to swap pages out to disk more often.
  • ClearList() is mighty fast now, as it does not have to walk the list to free all elements. All it has to do is walk the list of blocks and free those. (so in the above case thats 2000 operations instead of 120000000). A tiny exception here are string lists, as the list must be walked to free all strings anyway.

So whats the catch ?

Yes, there is a catch. There always is, but in this case the benefits much outweigh the costs. The problem is that the list will have unused but allocated elements when not all blocks are full. To limit this, allocation blocks grow dynamically depending on the list size, so the overhead of unused but allocated elements is very small compared to the overall list size.

The bigger problem arises when freeing part of the list. A block can only be freed when all its elements are unused. Consider the (highly unlikely) case that i use the list above, then free all elements but 2000 of them and unlucky as i am, there is exactly one element remaining on every single memory block. This means no memory is freed at all. This sounds like a big problem at first, but really its not that big of a deal. When allocating new elements, every available space is used first, so when removing/adding elements on a frequent basis (which is usually the case) the number of unused slots will still be very low. The only bad case is the above: making a list very big, then deleting random elements (not all) and then never touching the list again. This is a very unlikely case. Consider also this: In the stats from 4.20, 50% of the used memory was heap overhead. So even if now almost half of the allocated elements are not even used, you are still better off than before because the memory fragmentation is greatly reduced.

So overall i think this is a great improvement. Something to look forward to.