Tikalon Header

Computer Hashing

March 9, 2020

The Dilbert comic strip, more enjoyable in the early years than present, is popular because it reinforces the stereotypes about the clueless managers of technology companies. At our corporate research laboratory in the 1980s, we would often joke that the important technology we would be asked to investigate on Wednesday was the topic of an article in the New York Times on the previous day. Presently, the New York Times has degenerated into clickbait headlines in my Google News feed. I avoid clicking on these because of the paywall, and any useful information is available elsewhere at no cost.

At one memorable meeting with a corporate vice president, we were asked to describe our research in a way that even our mothers could understand. One of my colleagues remarked that his mother had a Ph.D. from MIT. Although my mother has just a high school education, she was well-read and had mental clarity until her death at age 91. Although there are many factors that complicate a scientific analysis,[1] there is evidence that mental exercise can maintain cognitive function. Every morning, my mother would do the Jumble word puzzle in her newspaper, and she was better at solving it than I was.

The Jumble puzzle, introduced in 1954, consists of a verbal clue, a drawing that illustrates the clue, and a set of words, the letters of each of which are jumbled by transposition. After these words are unjumbled, there are marked letters that construct the answer, which is often a clever pun. Since these puzzles are copyrighted, I can't show any of these, but a Google image search will produce many examples.

Magic square in Albrecht Durer's Melencolia I, 1514

People who desire mental exercise with numbers, rather than the words, as in the Jumble puzzle, are fans of Sudoku. Sudoku has a 9 x 9 grid in which each column and row, and each 3 x 3 subgrid, must must contain the numbers 1-9. As a starting point, some grid cells are already populated with numbers. Sudoku type puzzles appeared in the 19th century, but the puzzle's popularity surged with the ability of computers to automatically generate such puzzles to produce inexpensive puzzle books for supermarket checkout impulse purchase racks.

Real Mathematicians prefer a different number grid puzzle called the magic square in which the columns, rows, and sometimes diagonals, sum to a particular number. A famous example, shown above, is the 4 x 4 magic square summing to 34 found in Melencolia I, a 1514 engraving by Albrecht Dürer (1471–1528), now located at the Staatliche Kunsthalle Karlsruhe. Not only is 34 found in rows, columns, an diagonals, but also in each of the quadrants and the center four squares.

(Via Wikimedia Commons. Click for larger image.)


When something is too hard, I usually search for a computer solution. In 1998, I wrote a computer program that assisted in unscrambling the Jumble words. I reasoned that a word, even when its letters are scrambled, can be identified by a hash function, implemented as a checksum in which the ASCII values of its letters are summed. As an example, tikalon gives 754 as this sum. Even in 1998, there were quite a few English word lists available on the Internet; so, generating a lookup table was easy. My Linux desktop computer has a 1 megabyte text file of English words at /usr/share/dict/american-english/words that's also linked from /etc/dictionaries-common/words to make access possible for particular programs. Occasionally, there would be hash collisions in which multiple words are found to have the same checksum value, but the proper word can be identified by context.

This simple example of a hash function makes it easy to search through a large data set, but more complex hash functions can be used to verify the authenticity of a document or computer file. One of the first of these was SHA-1 (Secure Hash Algorithm 1), a hash function that produces a 160 bit hash value called a message digest. SHA-1 was developed by the United States National Security Agency in 1995 and issued as a standard by the United States National Institute of Standards and Technology.[2]

By today's standards, SHA-1 is very simple to code, as its pseudocode description on Wikipedia demonstrates. The computers of today are far more powerful than those of 1995; and, as a consequence, SHA-1 was seldom used after 2017, and it's not recommended for use. Instead, similar hash functions with a greater number of bits, such as SHA-256 with 256 bits, are recommended.

Such hash functions can easily discern even the smallest changes in a message. For example, the common phrase, The quick brown fox jumps over the lazy dog is easily distinguished from one in which the initial capital letter is made to be lower case; i.e., the quick brown fox jumps over the lazy dog. The first example gives an SHA-1 hash of
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12,
while the hash value of the second is
16312751ef9307c3fd1afbcb993cdc80464ba0f1.[3]
You can see that a change of just one letter results in a huge difference in the result.

It's always interesting to read about the early days of computing and the pioneers who accomplished so much with their limited electronic resources. The IEEE Spectrum magazine published an article in 2018 by Hallam Stevens, history professor at Nanyang Technological University, Singapore, about IBM engineer, Hans Peter Luhn.[4] Luhn was an early developer of computer hash functions.[4]

As another reminder of the major accomplishments done by the many people who emigrated to the United States, Luhn was born in 1896 (a year after the birth of my maternal grandmother), in Barmen, Germany, which is now merged with other cities into Wuppertal. Luhn was fortunate in being born into an affluent family, but he was at the right age to serve as a German soldier in World War I. After the war, Luhn became involved in the textile industry; and, being an engineer, in 1927 he invented a device, called the Lunometer, to measure the thread count of cloth.[4]

Fig. 1 from US Patent No. 2,011,722, 'Recipe Guide,' by Hans P Luhn, August 20, 1935

Fig. 1 from US Patent No. 2,011,722, "Recipe Guide," by Hans P Luhn, August 20, 1935.

This guide determined what cocktails could be made with available ingredients. The invention was timely, since Luhn filed with the Patent Office on September 16, 1933, and Prohibition in the United States ended on December 5, 1933.

Some of Luhn's early inventions included a device for shaping women's stockings, and a game table.[4]

(Via Google Patents.[5] Click for larger image.)


Luhn joined IBM to pursue inventions in information storage and retrieval technologies more advanced than the cocktail recipe guide of US Patent No. 2,011,722, shown in the above figure. He was a true inventor, producing 70 patents for his employer, IBM.[4] In the late 1940s, Luhn partnered with chemists at MIT, to produce a punch card system to search through chemical compounds.[4] He also invented the check digit for verification of credit card numbers known as the Luhn modulus 10 algorithm,[4,6] as well as the idea of content-addressable memory as a method of search.[4]

Luhn's most significant invention was demonstrated in November, 1958. This was KWIC, an acronym for Key Word in Context.[4] KWIC was able to produce a concordance for text articles up to 5,000 words in length.[4] As Stevens writes in his Spectrum article, "KWIC was basically the era's equivalent of a search engine: It allowed users to speedily locate the information they needed."[4] A concordance shows the frequency of each word that's used in a text, so the most important words of an article can be found. In the 1960s, KWIC was crucial to computerized indexing systems such as CAS and ISI.[4]

Eugene Garfield, May 9, 2007

Eugene Garfield (1925-2017), founder of the Institute for Scientific Information (ISI), and publisher of Current Contents, among my favorite library routing list items in the 1980s.

ISI created the Science Citation Index, which led to a measure of the importance of scientific journals called the impact factor. The impact factor determined that Nature and Science were highly esteemed sources of scientific information.

Garfield's work influenced the creation of the PageRank algorithm that powers Google search.

(Garfield photograph taken on May 9, 2007, when he accepted an award from the Chemical Heritage Foundation. Wikimedia Commons image from the Science History Institute.


In 1958, in an article in the IBM Journal of Research and Development, Luhn proposed a system for automatically generating Abstracts of articles by extracting the most relevant content of the articles.[7] He did this by generating a concordance and finding those sentences that contain the most frequent keywords. The idea was that the sentences containing the most highly used keywords were the most important. Just as in today's many AI systems, the algorithm has no understanding of the article content, but it's able to generate a quick summary using statistical techniques.[4]

References:

  1. Margaret Gatz, "Educating the Brain to Avoid Dementia: Can Mental Exercise Prevent Alzheimer Disease?" PLOS Medicine, January 25, 2005, https://doi.org/10.1371/journal.pmed.0020007.
  2. Federal Information Processing Standards Publication 180-4, Secure Hash Standard (SHS), National Institute of Standards and Technology (Gaithersburg, Maryland, August, 2017), http://dx.doi.org/10.6028/NIST.FIPS.180-4.
  3. On my Linux desktop, these hash values are obtained as follows:
    echo -n "The quick brown fox jumps over the lazy dog" | openssl sha1
    echo -n "the quick brown fox jumps over the lazy dog" | openssl sha1
    A zero length string gives a hash value of
    da39a3ee5e6b4b0d3255bfef95601890afd80709,
    while abc gives a hash value of
    a9993e364706816aba3e25717850c26c9cd0d89d:
    echo -n "" | openssl sha1
    echo -n "abc" | openssl sha1
  4. Hallam Stevens, "Hans Peter Luhn and the Birth of the Hashing Algorithm," IEEE Spectrum, January 30, 2018.
  5. Hans P. Luhn, "Recipe Guide," US Patent No. 2,011,722, August 3, 1935 (Via Google Patents).
  6. H. P. Luhn, "Computer For Verifying Numbers," US Patent No. 2,950,048, August 23, 1960 (Via Google Patents).
  7. H. P. Luhn, "A Business Intelligence System," IBM Journal of Research and Development, vol. 2 , no. 4 (October, 1958), pp. 314-319, DOI: 10.1147/rd.24.0314