LinkedList memory management

For 4.30, i worked a bit on LinkedLists, and how they allocate their elements. The original motivation was to ensure that linkedlist elements are as close together in memory as possible, limiting the penalty of page faults and cache misses when accessing/sorting large lists. In the end, the benefit is also for the more normal cases and not just these large extremes.

The obvious solution is to allocate the linkedlist elements in larger blocks, but this requires another level of memory management above that of the OS. What i have learned in my past encounters with memory management is that the OS memory management is very hard to beat or even come close to in performance, given how complex the task of managing memory is. However, with linkedlists there is a fact of which the OS functions (having to work in all scenarios) cannot take full advantage: A linkedlist needs a large number of small, equally sized memory blocks.

Lets start with some statistics:

My system runs Vista64 with 8Gb of RAM but i did my test with PB 32bit. So the limits described are purely due to the limitations of the processes own address space, and not due to overall system memory. I tested with a list of type Long. This is a very small example, but not a very unlikely one at that. So including the linkedlist pointers every element is 12bytes in size. I tried to squeze as many elements into the list as possible before getting a crash and then did some statistics on the heap. The results with a plain 4.20 final are this:

List elements:   40000000
Blocks total:    40000000
Size total:      457.76 Mb
Size average:    12 Byte
Overhead total:  457.76 Mb

First thing to notice is the heap overhead: It is exactly the same as the total amount of allocated memory. This is because the heap requires 12byte on average to manage an allocated memory block. So this is a very bad usage/overhead ratio. The second thing to notice is that 450Mb is not much usable memory, given the fact that the process should have almost its whole 2Gb of available user memory space for this list. This suggests a huge amount of fragmentation, filling up all space with not even 500Mb of usable data.

So whats the better solution ? It is to allocate the memory in larger blocks, with a minimum possible amount of extra management data per element. Also a free element must be found very fast. If there has to be a search for a free spot it will kill all performance. I ended up with a quite good solution requiring only 4byte of extra data per element, and findining a free element as well as freeing an element in O(1), meaning without any kind of loop to find the free spot. I won’t bore you with the details here. What i ended up is actually not a new idea. It is in fact quite old. Had i found the following paper earlier, i could have saved myself some thinking :-). If you are intrested, the paper is available here. My implementation is a bit different from what the paper describes, but the basic idea is the same.

So, here is the result from 4.30:

List elements:  120000000
Blocks total:	1977
Size total:    	1.79 Gb
Size average:	948.57 Kb
Overhead total:	15.55 Kb

So we can see: Many more elements fit into the list, with much more really used memory. The overhead is very small now because the average block size is about a Megabyte and there are not that many blocks because of that. But there are even more benefits from this:

  • Allocating/freeing elements is faster, because the number of alloc/free calls is greatly reduced.
  • Walking the list (and also sorting it of course) is faster, although not as much as i had hoped. This effect should be more visible on systems with less physical memory available, where the OS is forced to swap pages out to disk more often.
  • ClearList() is mighty fast now, as it does not have to walk the list to free all elements. All it has to do is walk the list of blocks and free those. (so in the above case thats 2000 operations instead of 120000000). A tiny exception here are string lists, as the list must be walked to free all strings anyway.

So whats the catch ?

Yes, there is a catch. There always is, but in this case the benefits much outweigh the costs. The problem is that the list will have unused but allocated elements when not all blocks are full. To limit this, allocation blocks grow dynamically depending on the list size, so the overhead of unused but allocated elements is very small compared to the overall list size.

The bigger problem arises when freeing part of the list. A block can only be freed when all its elements are unused. Consider the (highly unlikely) case that i use the list above, then free all elements but 2000 of them and unlucky as i am, there is exactly one element remaining on every single memory block. This means no memory is freed at all. This sounds like a big problem at first, but really its not that big of a deal. When allocating new elements, every available space is used first, so when removing/adding elements on a frequent basis (which is usually the case) the number of unused slots will still be very low. The only bad case is the above: making a list very big, then deleting random elements (not all) and then never touching the list again. This is a very unlikely case. Consider also this: In the stats from 4.20, 50% of the used memory was heap overhead. So even if now almost half of the allocated elements are not even used, you are still better off than before because the memory fragmentation is greatly reduced.

So overall i think this is a great improvement. Something to look forward to.

Automation to the rescue !

PureBasic has grown step by step, like every other long time going project,  adding support for more platforms accross the time. With the forthcoming release (4.30), we have added 2 official distributions: Windows x64 and MacOS x86. This led us to a problem: 4 additional packages to manage (full and demo for each of the platforms). Until now, we handled the packaging phase on demand, with own created tools (on Windows) and scripts on Linux and OS X. This was very time consuming to do all the packaging stuff, as you have to boot the correct OS, update all the SVN repositories and pray to have everything still compiling/working. Then you have to check the resulting package, check if nothing is missing and finally upload it on the remote purebasic.com site. As you can see, there was room for improvement. And we did. Let’s see how we processed:

1) We killed the Windows ‘update’ package. It was a .zip package which contained the delta files against PureBasic version… 2.00. Yeah, it was  8 years ago. This package was useful at time, as it was way smaller, and the internet connections were much slower. Now, the .zip is actually bigger than the regular install shield package, as every single files have changed since the version 2.00, so there is no delta anymore. We modified the install shield package to support updating. Result: 1 package removed.

2) So starting from now, we will have 10 packages (Windows x86/x64, Linux x86, MacOS X PowerPC/x86 – all in both demo and full version). The obvious idea was to use virtualization to have a single OS running all the PureBasic supported platforms. We got an offer from a PureBasic user to lend us a specific computer to use it as a build server: that’s was a perfect timing.  We installed Windows Vista x64 on it, and setup the according virtual machines.

3) The next step was to script everything to do all the packaging phase automatically. Here are (roughly) the actions needed:

- Update all the SVN repositories (we have 5 differents)
- Compiles the compiler, libraries, IDE and all provided tools (DocumentationMaker, LibraryMaker, PureUnit etc.) from the sources
- Run the test units for the libraries and the compiler
- Create the documentation
- Create the full package
- Create the demo package (derived from the full package)
- Upload the packages on purebasic.com

And you can repeat it for every OS. We used ‘bash’ (from cygwin on Windows and the native one on linux/osx) to have a common base for scripting. Windows could natively compiles x64 and x86 version of PureBasic, so this case was sorted. Then we used ssh (plink.exe from the putty package) to connect to the various running virtual machines and launch remotely the appropriate packaging script. The automatic ssh connection is handled with ssh key authentification. After a week of work, we had our automated build server, and believe us, that’s really a time and headache saver ;). Ha, the whole build time for all versions when nothing is built is over 6h. On a Core2 duo – 3Gb. When all is built (ie: no change in the libraries and so), it’s just under one hour to build everything and put it on the web site – which is quite ok.

With this new tool, we will be able to publish synchronized alpha, beta and official releases without having to crawl all over the OS and we hope it will increase the overall PureBasic release quality. It should also help when new platforms will be available (linux/osx x64 anyone ?).

Have fun !