For 4.30, i worked a bit on LinkedLists, and how they allocate their elements. The original motivation was to ensure that linkedlist elements are as close together in memory as possible, limiting the penalty of page faults and cache misses when accessing/sorting large lists. In the end, the benefit is also for the more normal cases and not just these large extremes.
The obvious solution is to allocate the linkedlist elements in larger blocks, but this requires another level of memory management above that of the OS. What i have learned in my past encounters with memory management is that the OS memory management is very hard to beat or even come close to in performance, given how complex the task of managing memory is. However, with linkedlists there is a fact of which the OS functions (having to work in all scenarios) cannot take full advantage: A linkedlist needs a large number of small, equally sized memory blocks.
Lets start with some statistics:
My system runs Vista64 with 8Gb of RAM but i did my test with PB 32bit. So the limits described are purely due to the limitations of the processes own address space, and not due to overall system memory. I tested with a list of type Long. This is a very small example, but not a very unlikely one at that. So including the linkedlist pointers every element is 12bytes in size. I tried to squeze as many elements into the list as possible before getting a crash and then did some statistics on the heap. The results with a plain 4.20 final are this:
List elements: 40000000 Blocks total: 40000000 Size total: 457.76 Mb Size average: 12 Byte Overhead total: 457.76 Mb
First thing to notice is the heap overhead: It is exactly the same as the total amount of allocated memory. This is because the heap requires 12byte on average to manage an allocated memory block. So this is a very bad usage/overhead ratio. The second thing to notice is that 450Mb is not much usable memory, given the fact that the process should have almost its whole 2Gb of available user memory space for this list. This suggests a huge amount of fragmentation, filling up all space with not even 500Mb of usable data.
So whats the better solution ? It is to allocate the memory in larger blocks, with a minimum possible amount of extra management data per element. Also a free element must be found very fast. If there has to be a search for a free spot it will kill all performance. I ended up with a quite good solution requiring only 4byte of extra data per element, and findining a free element as well as freeing an element in O(1), meaning without any kind of loop to find the free spot. I won’t bore you with the details here. What i ended up is actually not a new idea. It is in fact quite old. Had i found the following paper earlier, i could have saved myself some thinking :-). If you are intrested, the paper is available here. My implementation is a bit different from what the paper describes, but the basic idea is the same.
So, here is the result from 4.30:
List elements: 120000000 Blocks total: 1977 Size total: 1.79 Gb Size average: 948.57 Kb Overhead total: 15.55 Kb
So we can see: Many more elements fit into the list, with much more really used memory. The overhead is very small now because the average block size is about a Megabyte and there are not that many blocks because of that. But there are even more benefits from this:
- Allocating/freeing elements is faster, because the number of alloc/free calls is greatly reduced.
- Walking the list (and also sorting it of course) is faster, although not as much as i had hoped. This effect should be more visible on systems with less physical memory available, where the OS is forced to swap pages out to disk more often.
- ClearList() is mighty fast now, as it does not have to walk the list to free all elements. All it has to do is walk the list of blocks and free those. (so in the above case thats 2000 operations instead of 120000000). A tiny exception here are string lists, as the list must be walked to free all strings anyway.
So whats the catch ?
Yes, there is a catch. There always is, but in this case the benefits much outweigh the costs. The problem is that the list will have unused but allocated elements when not all blocks are full. To limit this, allocation blocks grow dynamically depending on the list size, so the overhead of unused but allocated elements is very small compared to the overall list size.
The bigger problem arises when freeing part of the list. A block can only be freed when all its elements are unused. Consider the (highly unlikely) case that i use the list above, then free all elements but 2000 of them and unlucky as i am, there is exactly one element remaining on every single memory block. This means no memory is freed at all. This sounds like a big problem at first, but really its not that big of a deal. When allocating new elements, every available space is used first, so when removing/adding elements on a frequent basis (which is usually the case) the number of unused slots will still be very low. The only bad case is the above: making a list very big, then deleting random elements (not all) and then never touching the list again. This is a very unlikely case. Consider also this: In the stats from 4.20, 50% of the used memory was heap overhead. So even if now almost half of the allocated elements are not even used, you are still better off than before because the memory fragmentation is greatly reduced.
So overall i think this is a great improvement. Something to look forward to.