Computer Hashing
March 9, 2020
The
Dilbert comic strip, more enjoyable in the early years than present, is
popular because it
reinforces the
stereotypes about the
clueless managers of
technology companies. At our
corporate research laboratory in the
1980s, we would often
joke that the important technology we would be asked to investigate on
Wednesday was the topic of an
article in the
New York Times on the previous day. Presently, the New York Times has degenerated into
clickbait headlines in my
Google News feed. I avoid clicking on these because of the
paywall, and any useful
information is available elsewhere at no
cost.
At one memorable
meeting with a
corporate vice president, we were asked to describe our
research in a way that even our
mothers could understand. One of my
colleagues remarked that his mother had a
Ph.D. from
MIT. Although my mother has just a
high school education, she was
well-read and had
mental clarity until her
death at age 91. Although there are many factors that complicate a
scientific analysis,[1] there is
evidence that
mental exercise can maintain
cognitive function. Every
morning, my mother would do the
Jumble word puzzle in her
newspaper, and she was better at solving it than I was.
The Jumble puzzle, introduced in 1954, consists of a
verbal clue, a
drawing that
illustrates the clue, and a set of
words, the
letters of each of which are
jumbled by
transposition. After these words are unjumbled, there are marked letters that construct the
answer, which is often a
clever pun. Since these puzzles are
copyrighted, I can't show any of these, but a
Google image search will produce many examples.
People who desire mental exercise with numbers, rather than the words, as in the Jumble puzzle, are fans of Sudoku. Sudoku has a 9 x 9 grid in which each column and row, and each 3 x 3 subgrid, must must contain the numbers 1-9. As a starting point, some grid cells are already populated with numbers. Sudoku type puzzles appeared in the 19th century, but the puzzle's popularity surged with the ability of computers to automatically generate such puzzles to produce inexpensive puzzle books for supermarket checkout impulse purchase racks.
Real Mathematicians prefer a different number grid puzzle called the magic square in which the columns, rows, and sometimes diagonals, sum to a particular number. A famous example, shown above, is the 4 x 4 magic square summing to 34 found in Melencolia I, a 1514 engraving by Albrecht Dürer (1471–1528), now located at the Staatliche Kunsthalle Karlsruhe. Not only is 34 found in rows, columns, an diagonals, but also in each of the quadrants and the center four squares.
(Via Wikimedia Commons. Click for larger image.)
When something is too hard, I usually search for a computer solution. In 1998, I wrote a
computer program that assisted in unscrambling the Jumble words. I reasoned that a word, even when its letters are scrambled, can be identified by a
hash function, implemented as a
checksum in which the
ASCII values of its letters are
summed. As an example,
tikalon gives 754 as this sum. Even in 1998, there were quite a few
English word lists available on the
Internet; so, generating a
lookup table was easy. My
Linux desktop computer has a 1
megabyte text file of English words at
/usr/share/dict/american-english/words that's also linked from
/etc/dictionaries-common/words to make access possible for particular programs. Occasionally, there would be
hash collisions in which multiple words are found to have the same checksum value, but the proper word can be identified by context.
This simple example of a hash function makes it easy to search through a large
data set, but more complex hash functions can be used to
verify the authenticity of a
document or
computer file. One of the first of these was
SHA-1 (Secure Hash Algorithm 1), a hash function that produces a 160 bit hash value called a
message digest. SHA-1 was developed by the
United States National Security Agency in 1995 and issued as a standard by the
United States National Institute of Standards and Technology.[2]
By today's standards, SHA-1 is very simple to
code, as its
pseudocode description on Wikipedia demonstrates. The computers of today are far more powerful than those of 1995; and, as a consequence, SHA-1 was seldom used after 2017, and it's not recommended for use. Instead, similar hash functions with a greater number of bits, such as
SHA-256 with 256 bits, are recommended.
Such hash functions can easily discern even the smallest changes in a message. For example, the common
phrase,
The quick brown fox jumps over the lazy dog is easily distinguished from one in which the initial capital letter is made to be lower case; i.e.,
the quick brown fox jumps over the lazy dog. The first example gives an SHA-1 hash of
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12,
while the hash value of the second is
16312751ef9307c3fd1afbcb993cdc80464ba0f1.[3]
You can see that a change of just one letter results in a huge difference in the result.
It's always interesting to read about the
early days of computing and the
pioneers who accomplished so much with their limited
electronic resources. The
IEEE Spectrum magazine published an article in 2018 by
Hallam Stevens,
history professor at
Nanyang Technological University,
Singapore, about
IBM engineer,
Hans Peter Luhn.[4] Luhn was an early developer of computer hash functions.[4]
As another reminder of the major accomplishments done by the many people who
emigrated to the United States, Luhn was born in 1896 (a year after the birth of my
maternal grandmother), in
Barmen, Germany, which is now merged with other
cities into
Wuppertal. Luhn was fortunate in being born into an
affluent family, but he was at the right age to serve as a
German soldier in
World War I. After the war, Luhn became involved in the
textile industry; and, being an engineer, in 1927 he
invented a device, called the Lunometer, to measure the
thread count of
cloth.[4]
Fig. 1 from US Patent No. 2,011,722, "Recipe Guide," by Hans P Luhn, August 20, 1935.
This guide determined what cocktails could be made with available ingredients. The invention was timely, since Luhn filed with the Patent Office on September 16, 1933, and Prohibition in the United States ended on December 5, 1933.
Some of Luhn's early inventions included a device for shaping women's stockings, and a game table.[4]
(Via Google Patents.[5] Click for larger image.)
Luhn joined IBM to pursue inventions in
information storage and retrieval technologies more advanced than the cocktail recipe guide of
US Patent No. 2,011,722, shown in the above figure. He was a true inventor, producing 70 patents for his
employer, IBM.[4] In the late
1940s, Luhn partnered with
chemists at MIT, to produce a
punch card system to search through
chemical compounds.[4] He also invented the
check digit for
verification of
credit card numbers known as the
Luhn modulus 10 algorithm,[4,6] as well as the idea of
content-addressable memory as a method of search.[4]
Luhn's most significant invention was demonstrated in November, 1958. This was
KWIC, an
acronym for
Key Word in Context.[4] KWIC was able to produce a
concordance for text articles up to 5,000 words in length.[4] As Stevens writes in his Spectrum article, "KWIC was basically the era's equivalent of a
search engine: It allowed users to speedily locate the information they needed."[4] A concordance shows the
frequency of each word that's used in a text, so the most important words of an article can be found. In the 1960s, KWIC was crucial to computerized indexing systems such as
CAS and
ISI.[4]
Eugene Garfield (1925-2017), founder of the Institute for Scientific Information (ISI), and publisher of Current Contents, among my favorite library routing list items in the 1980s.
ISI created the Science Citation Index, which led to a measure of the importance of scientific journals called the impact factor. The impact factor determined that Nature and Science were highly esteemed sources of scientific information.
Garfield's work influenced the creation of the PageRank algorithm that powers Google search.
(Garfield photograph taken on May 9, 2007, when he accepted an award from the Chemical Heritage Foundation. Wikimedia Commons image from the Science History Institute.
In 1958, in an article in the
IBM Journal of Research and Development, Luhn proposed a system for automatically generating
Abstracts of articles by extracting the most relevant content of the articles.[7] He did this by generating a concordance and finding those sentences that contain the most frequent keywords. The idea was that the sentences containing the most highly used keywords were the most important. Just as in today's many
AI systems, the
algorithm has no
understanding of the article content, but it's able to generate a quick summary using
statistical techniques.[4]
References:
- Margaret Gatz, "Educating the Brain to Avoid Dementia: Can Mental Exercise Prevent Alzheimer Disease?" PLOS Medicine, January 25, 2005, https://doi.org/10.1371/journal.pmed.0020007.
- Federal Information Processing Standards Publication 180-4, Secure Hash Standard (SHS), National Institute of Standards and Technology (Gaithersburg, Maryland, August, 2017), http://dx.doi.org/10.6028/NIST.FIPS.180-4.
- On my Linux desktop, these hash values are obtained as follows:
echo -n "The quick brown fox jumps over the lazy dog" | openssl sha1
echo -n "the quick brown fox jumps over the lazy dog" | openssl sha1
A zero length string gives a hash value of
da39a3ee5e6b4b0d3255bfef95601890afd80709,
while abc gives a hash value of
a9993e364706816aba3e25717850c26c9cd0d89d:
echo -n "" | openssl sha1
echo -n "abc" | openssl sha1
- Hallam Stevens, "Hans Peter Luhn and the Birth of the Hashing Algorithm," IEEE Spectrum, January 30, 2018.
- Hans P. Luhn, "Recipe Guide," US Patent No. 2,011,722, August 3, 1935 (Via Google Patents).
- H. P. Luhn, "Computer For Verifying Numbers," US Patent No. 2,950,048, August 23, 1960 (Via Google Patents).
- H. P. Luhn, "A Business Intelligence System," IBM Journal of Research and Development, vol. 2 , no. 4 (October, 1958), pp. 314-319, DOI: 10.1147/rd.24.0314