All the sorting methods considered thus far used a single array as both the source and the destination of the data. In most cases, a single temporary variable to hold a pivot or a value for swapping was all the extra storage required. Sometimes, however, it may not be practical to do things this way, and additional so data being sorted may be required. Consider, for instance, the problem of combining two already sorted lists of items into a single sorted list.
Combining two sorted lists into a single sorted list is called merging.
The original lists will not do as a destination for the data (they may well be too small for one thing). Even if each were actually large enough to hold all the data from both, an auxiliary store might still be needed to hold data from one while that from the other was being inserted into its proper place.
Merge list 2 with list 1.
list 1:
1,4,5,23,78,90,340,1190,3456,...
list 2:
150,234,250,300,340,456,784,987,1121,...
Because of this, merging is usually done from the two source lists into a third destination one. A counter is set up for each of the three lists, and the smallest of the two items at the current counter positions in the source lists is placed at the counter position in the destination. When one of the two sources is exhausted, the other source is copied into the destination. Here is some pseudocode to express this algorithm:
Merge: set countS1, countS2 and countD all to zero while countS1 < lengthS1 and countS2 < lengthS2 if source1 [countS1] < source2 [countS2] then set dest [countD] to source1 [countS1] increment countS1 else set dest [countD] to source2 [countS2] increment countS2 end while increment countD if countS1 = lengthS1 then while countS2 < lengthS2 set dest [countD] to source2 [countS2] increment countS2 increment countD end while else while countS1 < lengthS1 set dest [countD] to source1 [countS1] increment countS1 increment countD end while end if
This code could be simplified somewhat by introducing sentinels after the end of the two source ranges, provided those positions are available for use, and one is not concerned about a few extra comparisons. In this version, when one list is exhausted, the comparisons are still made, but to the sentinel, and the other list is appended without having to write separate code to do so.
Merge1: set countS1, and countS2 to zero set source1 [lengthS1] and source1 [lengthS2] to maxcard for countD = 0 to lengthS1 + lengthS2 if source1 [countS1] < source2 [countS2] then set dest [countD] to source1 [countS1] increment countS1 else set dest [countD] to source2 [countS2] increment countS2 end while increment countD
Merging is a useful technique for combining two sorted disk files into one. If the cost of the extra space can be borne, however, the method can also be used to sort an array.
One of the difficulties with the quicksort was the fact that it could not be guaranteed to partition the initial list into two (roughly) equal sub-lists, even when the middle item in the initial list is chosen as the pivot. This partitioning could be forced to take place, though it would be done at the expense of not positioning a particular item in the right position when the partition is done. Some other method must then be found to get the items in the right positions.
Suppose upon doing a partition that there is only one item in the sub-list created. Then, the sub-list is already sorted, and the routine can return. If there are two items, they can be compared and swapped if necessary. On a small scale, this gives a left and right sub-list (but not around a placed pivot) that are both sorted in themselves. At this point, the two lists need to be combined into a single sorted entity before returning to the next higher level of the recursive calls.
It is at this point that a merge is performed and some extra storage is required. Since all the sorting routines thus far have sorted open arrays, where the number of elements has not been specified ahead of time, it may not be immediately obvious where the extra storage space is going to come from. One solution is to enclose the working code in a nested procedure with a value parameter to which the array can be copied. The internal procedure can then copy from portions of this copy to the actual array which can remain global to it. Because the two lists being merged are adjacent in the original array, it is not possible to use the trick with the sentinels mentioned above, so the first of the two algorithms is employed here.
PROCEDURE MergeSort (VAR source: ARRAY OF CARDINAL; lBound, uBound : CARDINAL); PROCEDURE Merge (temp : ARRAY OF CARDINAL; left, mid, right : CARDINAL); (* merges lists [left .. mid] and [mid + 1 .. right] to the list [left .. right] copying from the temporary copy to the main array global to this procedure *) VAR countS1, countS2, countD : CARDINAL; BEGIN countS1 := left; countS2 := mid + 1; countD := left; WHILE (countS1 <= mid) AND (countS2 <= right) DO IF (temp [countS1] < temp [countS2]) THEN source [countD] := temp [countS1]; INC (countS1) ELSE source [countD] := temp [countS2]; INC (countS2) END; (* if *) INC (countD); END; (* while *) IF countS1 > mid THEN WHILE countS2 <= right DO source [countD] := temp [countS2]; INC (countS2); INC (countD); END; (* while *) ELSE WHILE countS1 <= mid DO source [countD] := temp [countS1]; INC (countS1); INC (countD); END; (* while *) END; (* if *) END Merge; VAR middle : CARDINAL; BEGIN (* main mergesort procedure *) IF lBound >= uBound THEN RETURN END; middle := (lBound + uBound) DIV 2; (* do the partition *) MergeSort (source, lBound, middle); (* sort the left half *) MergeSort (source, middle + 1, uBound); (* sort the right half *) Merge (source, lBound, middle, uBound); (* merge the two into a single list *) END MergeSort;
Although the merge sort can guarantee that it will do its comparisons and merges using partitions that are equal in size, the additional copying and the extra work in merging the two lists together make this sort uncompetitive with the quicksort.
It is possible to merge two lists that are sub-lists of a larger one in place without using any auxiliary storage. One way this can be done is to scan both lists side-by-side swapping items in the ith positions in each list that are out of order. (This is actually a k-sort with k equal to (list length) DIV 2. At the conclusion of this scan, the least item will be in the lowest position of the first sub-list. If this process is now repeated but with the first list starting at its second item, two items will be in place. When this step is repeated up to the last item in the first list, all its items will be less than or equal to all items in the second part and the entire list will be sorted.
Merge list 2 with list 1.
list 1: 1, 4, 5, 23, 78, 90, 340,1190,3456
list 2: 150, 234, 250, 300, 340, 456, 784, 987,1121
On the first pass, only the items in the last two positions are interchanged, and the first list is considered sorted at the first position and so shortened, producing:
1, 4, 5, 23, 78, 90, 340, 987,1121
150, 234, 250, 300, 340, 456, 784,1190,3456
which is now interpreted as:
1,
4, 5, 23, 78, 90, 340, 987,1121
150, 234, 250, 300, 340, 456, 784,1190,3456
and sorted as:
1,
4, 5, 23, 78, 90, 340, 784,1121
150, 234, 250, 300, 340, 456, 987,1190,3456
and subsequently, as:
1, 4,
5, 23, 78, 90, 340, 456, 987
150, 234, 250, 300, 340, 784,1121,1190,3456
1, 4, 5,
23, 78, 90, 300, 340, 784
150, 234, 250, 340, 456, 987,1121,1190,3456
1, 4, 5, 23,
78, 90, 250, 340, 456
150, 234, 300, 340, 784, 987,1121,1190,3456
1, 4, 5, 23, 78,
90, 234, 300, 340
150, 250, 340, 456, 784, 987,1121,1190,3456
1, 4, 5, 23, 78, 90,
150, 250, 340
234, 300, 340, 456, 784, 987,1121,1190,3456
1, 4, 5, 23, 78, 90, 150,
234, 300
250, 340, 340, 456, 784, 987,1121,1190,3456
1, 4, 5, 23, 78, 90, 150, 234,
250
300, 340, 340, 456, 784, 987,1121,1190,3456
at which point the list is sorted. The code is presented below:
PROCEDURE Merge1 (VAR source: ARRAY OF CARDINAL; left, mid, right : CARDINAL); VAR startS1, countS1, countS2 : CARDINAL; BEGIN startS1 := left; (* set first start point at left *) WHILE startS1 <= mid (* will go through left list *) DO (* set up counters on both sides *) countS1 := startS1; countS2 := mid + 1; WHILE (countS2 <= right) DO IF source [countS1] > source [countS2] THEN Swap (source [countS1], source [countS2]) END; INC (countS1); INC (countS2); END; (* while countS2 *) INC (startS1); (* next starting position in list one *) END; (* while startS1 *) END Merge1;
Yet another variation is to step through the first sorted sub-list, comparing with the first item only of the second sorted sub-list. If something is found in the first list that is larger than the first item of the second list, the two items are swapped. The new item in the first list will be in its correct final place. The new item in the second list may not be the smallest one there, so it must be inserted into its correct place. Using the example above,
Merge list 2 with list 1.
list 1:
1, 4, 5, 23, 78, 90, 340, 1190, 3456
list 2:
150, 234, 250, 300, 340, 456, 784, 987, 1121
The first stopping place in the scan is at 340, where the number 150 from the second list belongs (everything before that in the first list is already in place.) Place 150 into a temporary variable, assign 340 to its position in the second first list, and then insert the number 340 into its correct place in list2 by a series of shifts to the left. This produces:
list 1:
1, 4, 5, 23, 78, 90, 150, 1190, 3456
list 2:
234, 250, 300, 340, 350, 456, 784, 987, 1121
At this point, there are still two sorted lists to merge, but the process can be picked up from where it was left off in the first list. On successive steps, one gets:
list 1:
1, 4, 5, 23, 78, 90, 150, 234, 3456
list 2:
250, 300, 340, 350, 456, 784, 987, 1121, 1190
list 1:
1, 4, 5, 23, 78, 90, 150, 234, 250
list 2:
300, 340, 350, 456, 784, 987, 1121, 1190, 3456
As soon as the end of the first list is reached, the merge is complete, for all the items in the second list are greater than the last one in the first list, and both lists have been kept in order throughout.
Writing this algorithm as a Modula-2 procedure, one obtains:
PROCEDURE Merge2 (VAR source: ARRAY OF CARDINAL; left, mid, right : CARDINAL); VAR countS1, countS2, temp : CARDINAL; BEGIN countS1 := left; (* start through first list *) WHILE countS1 <= mid (* quit when first list exhausted *) DO IF source [countS1] > source [mid + 1] (* compare to first in second list *) THEN temp := source [countS1]; (* set aside first list item *) source [countS1] := source [mid + 1]; (* give its spot up *) (* now find the correct place for that item that was in the first list and must now be in the second one instead *) countS2 := mid + 2; WHILE (countS2 <= right) AND (temp > source [countS2]) DO (* a typical insert loop *) source [countS2 - 1] := source [countS2]; INC (countS2); END; source [countS2 - 1] := temp; (* stick it in its spot *) END; (* if *) INC (countS1); (* continue counting on first list *) END; (* while *) END Merge2;
It has been left as an exercise to the student to determine whether this method of merging gives better results than the one that used additional storage. Notice, however, that many shifts in position are required by this method, and this will reduce its efficiency.
There are other sorting methods than the ones presented in this chapter. However, the ones given here are represent the easiest to understand and implement for the purpose of sorting arrays. Other advanced techniques also have their place in specialized circumstances, for not all data comes in arrays. These other methods will be touched upon later in the text, once the appropriate data structures have been considered.
As indicated in section 12.2.2, another situation where one might wish to sort using additional storage is the one in which the actual data items are large and therefore inconvenient to move around in memory. In such cases, it may be more efficient to use an array of pointers to the storage where the items are kept, and just move the pointers around. This is especially useful if the records are to be sorted sometimes according to one field, and sometimes according to another.
For example, suppose that the basic data types were defined by:
FROM Strings IMPORT Compare, CompareResults; TYPE Data = RECORD studentNumber : CARDINAL; studentName : ARRAY [0..80] OF CHAR; (* many other fields here *) END; DataPoint = POINTER TO Data; CompareDataProc = PROCEDURE (Data, Data) : CompareResults; Adrs = ARRAY [1 .. 13] OF DataPoint; VAR theStuff : ARRAY [1 .. 13] OF Data; theStuffNumAdrs, theStuffNameAdrs : Adrs; count : CARDINAL; CompareData : CompareDataProc;
The purpose of the procedure type is to allow the code for the sorting procedure to be written only once, but called in two different ways from the main program. Here, a test program was written using the following procedures:
PROCEDURE CompareDataNum (first, second : Data) : CompareResults; BEGIN IF first.studentNumber < second.studentNumber (* alter code to suit data type *) THEN RETURN less ELSIF first.studentNumber > second.studentNumber THEN (* here too *) RETURN greater ELSE RETURN equal END; END CompareDataNum; PROCEDURE CompareDataName (first, second : Data) : CompareResults; BEGIN RETURN Compare (first.studentName, second.studentName); END CompareDataName;
First, the main data array theStuff was initialized by giving values to the two fields. This data will remain in place and not be moved throughout the sorting. Next, a simple procedure was written to output the data, and one of the versions of the ShellSort was modified in a single place (highlighted in the code below) to handle the data:
PROCEDURE ShellSort (VAR source: ARRAY OF DataPoint; lBound, uBound : CARDINAL); VAR k, listCount, count, moveCount: CARDINAL; temp : DataPoint; BEGIN k := 1; REPEAT (* get first "k" in sequence *) k := 3 * k + 1 UNTIL k > uBound - lBound; REPEAT k := k DIV 3; (* work backwards on same sequence *) FOR listCount := lBound TO lBound + k - 1 (* Start as many k-lists as possible. *) DO count := listCount + k; WHILE count <= uBound DO (* each k-sort starts here *) temp := source [count]; (* set aside pointer to unsorted one *) moveCount := count; WHILE (moveCount >= lBound + k) AND (* (source [moveCount - k] > temp) *) (* original *) (CompareData (source [moveCount - k]^, temp^) = greater) DO source [moveCount] := source [moveCount - k]; (* move along to make room *) DEC (moveCount, k) (* count backward in k-list *) END; source [moveCount] := temp; (* insert new item *) INC (count, k); END; (* while count *) END; (* for listCount *) UNTIL k = 1 END ShellSort;
In the main program code, two copies were made of the arrays of pointers to this data:
FOR count := 1 TO 13 (* set up addresses *) DO theStuffNumAdrs [count] := ADR (theStuff [count]) END; theStuffNameAdrs := theStuffNumAdrs; (* start both arrays same *)
Finally, the sorting procedures were checked with invocations (among other) including:
CompareData := CompareDataNum; PrintIt (theStuffNumAdrs,13, 4, 5); WriteLn; (* before num sort *) ShellSort (theStuffNumAdrs,0,12); (* whole thing *) PrintIt (theStuffNumAdrs,13, 4, 5); WriteLn; (* after num sort *) CompareData := CompareDataName; PrintIt (theStuffNameAdrs,13, 4, 5); WriteLn; (* before name sort *) ShellSort (theStuffNameAdrs,0,12); (* whole thing *) PrintIt (theStuffNameAdrs,13, 4, 5); WriteLn; (* after name sort *)
This version of the procedure PrintIt is not included here, but the parameters indicate the number of items to print, and some formatting information. The results were:
113 = joe 77 = bob 0 = alice 50 = gerda 113 = richard 114 = allan 900 = fred 113 = freda 15 = donna 300 = rod 13 = don 135 = joe 1 = Alouyicious 0 = alice 1 = Alouyicious 13 = don 15 = donna 50 = gerda 77 = bob 113 = freda 113 = joe 113 = richard 114 = allan 135 = joe 300 = rod 900 = fred 113 = joe 77 = bob 0 = alice 50 = gerda 113 = richard 114 = allan 900 = fred 113 = freda 15 = donna 300 = rod 13 = don 135 = joe 1 = Alouyicious 1 = Alouyicious 0 = alice 114 = allan 77 = bob 13 = don 15 = donna 900 = fred 113 = freda 50 = gerda 113 = joe 135 = joe 113 = richard 300 = rod
Notice that because the actual array is accessed via the pointers (including in the procedure PrintIt) the sorting by numbers has no affect on the sorting by name--the pointers are in different arrays, and the original data is never moved. Of course, something similar could be done with other sorting algorithms, and with much larger data collections.