Tikalon Header

Sorting Terabytes

August 6, 2010

Sorting is an essential human activity. Putting things in alphabetical order takes time while it's being done, but it has benefits later when you're looking for something. That's why books contain indices. Sorting is good also in decision making. When you have only a certain amount of money to spend in a month, you rank things according to importance and draw a line to separate the necessities from things that are lower on your list. Computers have made sorting easier, but at the same time they've allowed the accumulation of so much data that efficient methods of sorting large quantities of data are required.

Haven't we already found the "best" sorting method and discarded all the rest? Well, there's a problem here that involves the computer architecture used to do the sorting. Some sorts are better on some computers than on others. Some fast sorting methods require a lot of memory, but slower ones can sort your items "in place;" that is, you don't need any more memory than the memory in which the data already reside. Sometimes, the dataset is so huge that it's stored in pieces, which presents a problem. Sometimes the dataset is so small that speed is not an issue - a blink of an eye is as good as a half blink - so the programmer uses the technique that involves fewer lines of code. I've used the least efficient sorting method, "bubble sort," on many occasions, since I can write the code off the top of my head. Here's a short list of some sorting methods.[1]

• Bubble sort
• Insertion sort
• Shell sort
• Merge sort
• Heapsort
• Quicksort
• Counting Sort
• Bucket sort
• Radix sort
• Distribution sort

The winner for most desktop applications is "Quicksort." I won't bore you with the details,which can be found elsewhere, but it's very good. In the past (before Wikipedia), Quicksort was "reinvented" many times over by the uninitiated who thought their names would go down in computer history. Quicksort was invented in 1960 by C. A. R. Hoare. As an example of necessity being the mother of invention, Hoare invented the technique to sort vocabulary words for a Russian-English machine translation program.

alphabet blocks

There's a realm of sorting with which few are acquainted; namely, the sorting of terabytes of data. One terabyte is 1,000 gigabytes, or one million megabytes. That's 1012 bytes of data, and it would be equivalent to the records of a total world census, if such existed; or a few seconds of data from the Large Hadron Collider. A terabyte of data is equivalent to about 40 single-layer Blu-Ray discs. Looking ahead, some companies would like to sort petabytes (1,000 terabytes) of data. There's even a contest, of sorts, maintained at http://sortbenchmark.org, that has a goal of rapidly sorting terabytes of data, and terabytes of records that are 100 bytes long. This year's record, established by a team from the University of California, San Diego, was 1.014 terabytes in one minute. This same team also broke another record by sorting one trillion data records in 172 minutes.

The sorting benchmark tests are called the Indy Minute Sort and the Indy Gray Sort. What's significant is that the UCSD team used just a quarter the number of computers as the previous record holder. They were able to do this by careful balancing of resources, such that no resources were unnecessarily idle. Another set of benchmarks at http://sortbenchmark.org are for minimal energy sorting, something that's becoming important in today's energy-starved world. Said team member Amin Vahdat, Chair in Engineering in the Department of Computer Science and Engineering and director of the UCSD Center for Networked Systems, "If you are idling your processors or not using all your RAM, you're burning energy and losing efficiency."[2]

The UCSD system was relatively simple.[3] It was composed of 52 HP ProLiant DL380 G6 servers, each of which had two quadcore Intel Xeon E5520 2.27 GHz processors, 24 GB of RAM, and sixteen 2.5-inch 500 GB, 7200 RPM SATA hard drives. Each server had a Myricom 10Gbps network card, and all communications were via standard Ethernet, connected by a Cisco Nexus 5020 10 Gbps switch. Of course, all machines ran the Linux operating system, version 2.6.32.8, with the ext4 filesystem.

References:

  1. Sorting Algorithms on Wikipedia.
  2. Data sorting world record: 1 terabyte, 1 minute, Science Blog (July 27, 2010).
  3. Alexander Rasmussen, Harsha V. Madhyastha, Radhika Niranjan Mysore, Michael Conley, Alexander Pucher, George Porter and Amin Vahdat, "TritonSort," PDF Report.
  4. Dictionary of Algorithms and Data Structures, National Institute of Standards and Technology, Software and Systems Division, Information Technology Laboratory.
  5. Sorting Algorithm Animations at sorting-algorithms.com.