Monthly Archives: January 2009

Parallel frenzy

Ever since i got a QuadCore, i have been looking for ways to make it simple for people to make use of this extra power in today’s computers. Only a handful of programs available today can really use this power. Thread programming is not easy. It is hard to do it without introducing hidden bugs, and as i found out myself it is even harder to do it in a way that fully utilizes the available resources and does not spend all its time waiting on synchronisation points. Writing a piece of code that can max out 4 cores and still perform something useful is quite a challange 🙂 And if trends continue, even 8 core machines are not unimaginable.

So how do we make this simpler ? Even if writing a fully threaded program remains a challange, maybe we can introduce commands that make it simpler to parallelize smaller portions of the program. The most obvious place to start are the sort commands, as these contain CPU intensive tasks which can be parallelized well. So i wrote some parallel sort routines. The implementation can scale up to as many cores as are available. So in 4.40, at least the sorting commands can utilize the full power of the CPU. I wrote a small framework to make parallelizing PB commands easier, so other commands may benefit as well. (although most commands are just not CPU intensive enough to make this useful).

SortArray() vs. parallel SortArray()

Here you see some test results for SortArray() on my QuadCore (showing the sort times for an array of longs). I tested the original version, as well as a 2 and a 4 thread run of the command to see how it would perform on a DualCore or a QuadCore. The
improvement of the parallel variants is quite constant. The 2 thread test took on average 52% of the execution of the serial version while the 4 thread version takes on average 31% of the time. The improvement actually increases a little bit the larger the array gets.

Even while this is a good improvement, why not more ? After all, 31% is only about a factor of 3 while we would expect a factor of 4 in the 4 thread variant. Some of this is due to the management overhead, but there is also a part which simply cannot be parallelized. Not all threads can get to work immediately. The first partition step has to be done in one thread, before 2 threads can get to work on the second step of the sorting (on each half of the data). Only in the third level (and all subsequent ones) can 4 threads even find work. This serial part is always there, and it limits the improvement that can be made by parallelizing the other part, no matter how many cores are present (this is called Amdahl’s law).

SortList() vs. parallel SortList()

The results for SortList() are not as good. Here the results vary depending on the list size, with a best of 55% but average of 71% for the 2 thread test and a best of 47% and average of 67% for the 4 thread version. Parallelizing a mergesort takes some more overhead than quicksort, as there is some waiting time involved until two smaller sort steps can be merged into a larger one. Since the larger sort steps come last and depend on the smaller ones (unlike with quicksort) this cannot be avoided. Also since a list is spread more through memory than an array, cache effects will play a role too. Still, both the results for SortList() and SortArray() are a good improvement in my opinion.

How hard it is to write code that fully utilizes the CPU power can be seen on the parallel SortList(). I went through a number of different implementations for this command. Trying to come up with more compact versions (which use less code or less memory reads/writes) only to find out that they performed worse than the original serial version due to cache effects. For example i found a mergesort code on the net witch uses only a handful of variables and a minimum amount of read operations. I implemented it in asm, using exclusively registers to store the variables. The instincts tell us that this is an optimal solution. But because of the way the code reads the list makes it read and overwrite values in the cache on every run. This makes it run slow as hell for large lists, even if executed on 4 cores. The key word here is “locality“. Memory locations which are close together must be processed together so that the values are still in the cache (or at least in many memory for large lists). The current implementation performs quite well in this respect.

By the way, some of this locality optimisation is already present in the 4.30 release. One of the reason why i wrote the block allocator about which i talked in my first blog post was to make sure that list elements are closer together. This way both the sort, but also a normal walk through the list can be executed in a more cache-friendly way. The block allocation was working in time for 4.30, the parallel stuff did not make it to that release.

We have some more new things planned related threads and parallel programming which should make it simpler to write threaded programs in PB, but the rest is a surprise… 😉