Page 1 of 1

CountingIntSort - sorting quicker than quick

Posted: Sun May 23, 2004 6:39 pm
by Froggerprogger
Code updated for 5.20+

[edit]
1st I updated the following code, it works with linked lists now, too.
2nd I read about the 'CountingSort' and realized that this one is very close to it, but this here contains less overhead (and does not work with structures for that reason - in opposite to 'original' CountingSort) - so I call it now CountingIntSort.
[/edit]


Hi!
The following sort algorithm is VERY fast.
But it has 2 handicaps:
1st: You have to keep the range of values small ~ < 1.000.000
2nd: It does only work for integer values in this version

(I'm thinking on a possibility to get it working with a 32-Bit-range, and then floats and strings, too.)

Code: Select all

;-
;-  CountingIntSort 1.1 (PB - Windows)
;- 
;-
;-  PRO:
;-  This algo is really fast (here > 6 times faster than SortArray()-Quicksort at 2.000.000 elements
;-  in range -5.000 To 5.000), and it's speed increases just linear with arraysize,
;-  so it's getting faster !! [speed is Theta(n) and not Theta(n*log2(n)) ]
;-  Further it's speed is independant of the given order of the elements.
;-  It works for Linked Lists, too.
;- 
;-  CONTRA:
;-  You'll probably need A LOT of memory:
;-  memory usage: 4 Bytes * number_of_different_values_in_array
;-  So don't use this algo for LONG-values in the whole range (you'll need to allocate 16GB memory)
;-  So use it for values that lay in a maximum range of ~1.000.000, because the greater the range
;-  the longer the memory-allocation lasts.
;-  Further it doesn't work with floats or strings.
;-
;-
;-  USAGE:
;-  You can simply call one of the two 'interface'-procedures:
;-
;-  CountingIntSort(    pointerToData [e.g. Arrayname()],
;-                      minValueInArray,
;-                      maxValueInArray)
;-
;-  CountingIntSortLL(  pointerToListelement [e.g. FirstElement(MyList())],
;-                      dataType [#Long - you may use #LL_Long also],
;-                      minValueInArray,
;-                      maxValueInArray)
;-
;-                      (Independant of pointerToListelement the whole list will be sorted)
;-
;-
;-  ...or alternatively you might mention the required data explicitely:
;-
;-  CountingIntSortEx(  pointerToDataInMemory, [or to ListElement]
;-                      maximumID [e.g. numberOfElements-1, or CountList(MyList())-1],
;-                      dataType [#Long | #LL_Long - use #LL_Long for a Linked List]
;-                      minValueInArray,
;-                      maxValueInArray)
;-
;-
;-  The values max/minValueInArray might be any integers
;-
;-
;-  by Froggerprogger, 27.05.04
;-
;-

;- declarations
; constants
#LL_Long = $FFFFFFFF

; structures
Structure LL_ELEM
  nextElem.i
  prevElem.i
  StructureUnion
    l.l
    f.f
  EndStructureUnion
  s.s
EndStructure

; declares
Declare.l CountingIntSort(*p_array, p_minVal.l, p_maxVal.l)
Declare.l CountingIntSortLL(*p_LL.LL_ELEM, p_arrType.l, p_minVal.l, p_maxVal.l)
Declare.l CountingIntSortEx(*p_array, p_maxID.l, p_arrType.l, p_minVal.l, p_maxVal.l)

;- procedures
Procedure.l CountingIntSort(*p_array, p_minVal.l, p_maxVal.l)
  ; call the main-routine with the correct parameters
  If *p_array = 0 Or PeekL(*p_array - 4) <> 5
    ProcedureReturn #False
  Else
    ProcedureReturn CountingIntSortEx(*p_array, PeekL(*p_array - 8) - 1, #PB_Long, p_minVal, p_maxVal)
  EndIf
EndProcedure

Procedure.l CountingIntSortLL(*p_LL.LL_ELEM, p_arrType.l, p_minVal.l, p_maxVal.l)
  Protected maxID.l
  Protected *LLelem.LL_ELEM 
  
  *p_LL - SizeOf(Integer)*2 ; FirstElement() returns the address to the data, so we need to rewind to get the start address of the linkedlist element
 
  If *p_LL = 0 Or *p_LL = 8
    ProcedureReturn #False
  EndIf
 
  ; search the first element
  While *p_LL\prevElem <> 0
    *p_LL = *p_LL\prevElem
  Wend
   
  ; copy the pointer to the first element
  *LLelem = *p_LL
 
  ; count the elements - as fast as CountList()
  While *LLelem\nextElem <> 0
    *LLelem = *LLelem\nextElem
    maxID + 1
  Wend

  ; call the main-routine with the correct parameters
  If (p_arrType <> #PB_Long And p_arrType <> #LL_Long) Or maxID = 0
    ProcedureReturn #False
  Else
    ProcedureReturn CountingIntSortEx(*p_LL, maxID, #LL_Long, p_minVal, p_maxVal)
  EndIf
EndProcedure

Procedure.l CountingIntSortEx(*p_array, p_maxID.l, p_arrType.l, p_minVal.l, p_maxVal.l)
 
  ; used generally
  Protected *mem
  Protected i.l, j.l
  Protected *offset
  Protected range.l
 
  ; used by stream out
  Protected hits.l
  Protected actVal.l
 
  ; used by Linked Lists
  Protected *LLelem.LL_ELEM

  ; check for valid parameters
  If *p_array = 0 Or (p_arrType <> #PB_Long And p_arrType <> #LL_Long)
    ProcedureReturn #False
  EndIf
 
  ; if it is a list then search for the first element
  If p_arrType = #LL_Long
    If *p_array = 8
      ProcedureReturn #False
    EndIf
    *LLelem = *p_array
    ; search the first element
    While *LLelem\prevElem <> 0
      *LLelem = *LLelem\prevElem
    Wend
    *p_array = *LLelem
  EndIf
 
  ; Exchange p_minVal <--> p_maxVal if necessary
  If p_minVal > p_maxVal
     actVal = p_minVal
     p_minVal = p_maxVal
     p_maxVal = actVal
  EndIf

  ; calculate the range
  range = p_maxVal - p_minVal + 1

  ; allocate Memory for sorting and quit if not successful
  *mem = AllocateMemory(4 * (range + 1))
  If *mem = 0
    ProcedureReturn #False
  EndIf


  ;- do the main sorting for type LONG
  If p_arrType = #PB_Long
 
    ; Jump in and count the number of hits
    For i=0 To p_maxID
      actVal = PeekL(*p_array + 4*i)
      If actVal < p_minVal Or actVal > p_maxVal
        FreeMemory(*mem)
        ProcedureReturn #False
      EndIf
      *offset = *mem + (actVal - p_minVal) * 4
      PokeL(*offset, PeekL(*offset) + 1)
      hits + 1
    Next
   
    ; Stream out the values of the hits in sorted order
    *offset = 0
    For i=0 To 4 * range Step 4
      hits = PeekL(*mem + i)
      If hits > 0
        actVal = p_minVal + i / 4
        For j = 1 To hits
          PokeL(*p_array + *offset, actVal)
          *offset + 4
        Next 
      EndIf
    Next
   
   
  ;- do the main sorting for type LL_LONG
  ElseIf p_arrType = #LL_Long
    ; set *LLelem to our first element
    *LLelem = *p_array
   
    ; Jump in and count the number of hits
    For i=0 To p_maxID
      actVal = *LLelem\l
      If actVal < p_minVal Or actVal > p_maxVal
        FreeMemory(*mem)
        ProcedureReturn #False
      EndIf
      *offset = *mem + (actVal - p_minVal) * 4
      PokeL(*offset, PeekL(*offset) + 1)
      *LLelem = *LLelem\nextElem
    Next

    ; set *LLelem back to our first element
    *LLelem = *p_array
   
    ; Stream out the values of the hits in sorted order
    For i=0 To 4 * range Step 4
      hits = PeekL(*mem + i)
      If hits > 0
        actVal = p_minVal + i / 4
        For j = 1 To hits
          *LLelem\l = actVal
          *LLelem = *LLelem\nextElem
        Next 
      EndIf
    Next
  EndIf
 
  FreeMemory(*mem)
 
  ProcedureReturn #True
EndProcedure





;-
;- example / speed comparison
;-
#maxID = 2000000 ; (2.000.000)
Dim A.l(#maxID)
Dim B.l(#maxID)
NewList LL.l()

lower = -5000
upper = 5000

; create 3 times the same data
For i=0 To #maxID
  A(i) = Random(upper - lower) + lower
  B(i) = A(i)
  AddElement(LL())
  LL() = A(i)
Next

MessageRequester("","Ready to start comparison...",0)

start = GetTickCount_()
  CountingIntSort(A(), lower, upper)
stop1 = GetTickCount_() - start

start = GetTickCount_()
  CountingIntSortLL(FirstElement(LL()), #PB_Long, lower, upper)
stop2 = GetTickCount_() - start

start = GetTickCount_()
  SortArray(B(), 0)
stop3 = GetTickCount_() - start

; Check if the result is in correct order
last = $80000000
For i=0 To #maxID
  If A(i) < last
    MessageRequester("","Error in CountingIntSort() : " + Str(A(i)) + " < " + Str(last),0)
    Break
  EndIf
  last = A(i)
Next
last = $80000000
ForEach LL()
  If LL() < last
    MessageRequester("","Error in CountingIntSortLL() : " + Str(LL()) + " < " + Str(last) ,0)
    Break
  EndIf
  last = LL()
Next
last = $80000000
For i=0 To #maxID
  If B(i) < last
    MessageRequester("","Error in SortArray() : " + Str(B(i)) + " < " + Str(last),0)
    Break
  EndIf
  last = B(i)
Next

MessageRequester("","CountingIntSort: " + Str(stop1) + " ms" + Chr(13)+Chr(10)+ "CountingIntSortLL: " + Str(stop2)+" ms"+Chr(13)+Chr(10)+"PB-Sort: " + Str(stop3) + " ms")

Posted: Mon May 24, 2004 12:45 pm
by KarLKoX
Very nice (and usefull) :)

Posted: Mon May 24, 2004 1:56 pm
by Dare2
Pretty impressive. :)

Posted: Mon May 24, 2004 3:16 pm
by Froggerprogger
...fixed 3 'bugs':

- changed '((' to '('
- changed ')) + 1' to '+ 1 )' (whoops :) )

- the 'docs' were talking of DirectJumpSort() instead of DirectSort()
hmmmm. Or did DirectJumpSort even sound better ?

Posted: Tue May 25, 2004 8:05 pm
by oldefoxx
Writing effective Sort processes can be really interesting work. Using
recursive calls does have a down side, in that you can exceed the limits imposed by the size of the available stack space or your memory. You can use a Global structure in place of the stack, which gives you some external control over the size and limits (as well as determining how much you really need as well).

One problem with some sort processes is their disregard for any prior sort order - that is, if I have five arrays related to each other, say LastName(), FirstName(), MI(), Phone(), DOB(), and I wanted to sort these into a given order, then I would start with the least significant sort first and do the major sort last. In this case, I would probably be content to sort by MI(), then FirstName(), then LastName(), so that my final sort order would be by Last Name, First Name, MI.


The problem is, if I tried this with Hash sort and some of the other methods which do not retain prior sort order, the only sort order that carries through is the last one done. If I did my three sorts in the order I named, the Last Names() would be correctly sorted, but all the other arrays would be jumbled up by the last sort. So the chance are that I might find something like:

Tailor, Steve M.
Tailor, Ann E.
Tailor, Carl J.
Tailor, Arthur G.

That is the reason that QuickSort continues to be a favorite, since it does retain memory of prior order.

Now note that the Integer sort itself is not very useful, since it does not tell us very much about any other properties. For instance, what do these integers represent? And how can we use that information?

The other thing is that we classify those properties into different groups, just as I did with the names, phone numbers, and ages above. If we sort one, how do we maintain the relationship of that one to all the other groups?

The answer is that we create what is called an index array, where any index in that array points to the corresponding element in each of the groups. If I have an index of 27, then I am referring to the 27th entry in each of the other groups.

So instead of changing the order of elements in LastName(), I change the order of the indexes in the indexed array to reflect the new order in the LastName() array. So if the last name in LastName(27) was Aabbay, placing this person first in my sort order, then Index(1)=27 would reflect that after the sort. Using LastName(Index(1)) as well as FirstName(Index(1)) and Index(1) for each of the other groups would reflect the same entry (27 in this case), but the index count of 1 would let me know that this was the first person in my sort order.

By combinding these techniques (a sort process that retains prior order, and a secondary index array to track prior order), you can write a sort process that has many applications and uses.

Posted: Fri May 28, 2004 12:32 am
by Froggerprogger
That is the reason that QuickSort continues to be a favorite, since it does retain memory of prior order.
I never thought about that - but that's really a big point for Quicksort.

So instead of changing the order of elements in LastName(), I change the order of the indexes in the indexed array
Yes, the above Sort-algo cannot handle Structures with a kind of sort-key inside, but the 'real' Counting-Sort can do it. It just sorts the indexes and can resort all the original data afterwards.
But anyway the problem of the small value range persists.

btw: I updated the code in the first topic - it works with lists now, too (and this kind of ListSort can still be faster than SortArray() )