"What's in a name? that which we call a roseWords are built from alphabet characters, and some characters are more common in words than others. This works to the advantage of cryptologists, who use this principle to decipher weak codes. There are available on the internet frequency tables of character occurrence in many languages.[2] These depend, somewhat, on what corpus has been analyzed, so there are minor differences between different tables for the same language. It will come as no surprise that the letter e is the most common character of the English language (12.7%), followed by t (9.056%), a (8.167%) and o (7.507%). The twelve most frequent characters, which are less than half the English alphabet, are used about 80% of the time. It's almost as if we can throw half our alphabet away, and still be understood.[3] Everyone likes to think that their name is unique, and there may be some validity to that idea, as I'll explain. There's been some recent excitement about a discovery at Fermilab that may indicate a new subatomic particle, or possibly, a new type of force.[4] These results are interesting, since the data indicating such a finding are observed at the three-sigma level. Most physicists start to believe in things at the two-sigma level, or about 95% confidence. Three-sigma corresponds to a 99.7% confidence level, which looks like a near certainty. As exciting as all this might be, we'll wait for more data from the Large Hadron Collider before writing an article on this. The reason I mention this work is because the preprint describing the Fermilab result posted on arXiv has 507 authors.[5] This number is not unusual for a paper describing an accelerator experiment, but it does give a convenient source of scientist names for analysis. As the figure shows, the frequency distribution of letters in these names shows significant differences from standard English text. English letter distribution for general text (blue) and for the names of accelerator physicists (red). All characters of the listed name, including initials, were used, and non-English accented characters (e.g., á and é) were converted to non-accented characters. (Plot via Gnumeric) As can be seen in the figure, the names of accelerator physicists are deficient in e, t and h, and somewhat enriched in a, m and k. Of special interest are j and z, which are nearly absent from general texts, but quite prevalent in these names. I decided to develop a metric, which I call the Fermi Number, that expresses the "Ferminess" of a name; that is, with how much certainty we can put it in the same bin as the Fermilab authors. This equation, which doesn't use the actual Fermi function, is as follows: where ffermilab is the frequency of a character in the Fermilab sample, fgeneral is the frequency of the same character in general text, and the sum is over all the characters n in the word. Negative Fermi Numbers indicate that a name is not likely to be an author's name on a Fermilab paper, and positive numbers indicate that it might. As anecdotal evidence, my own last name has a Fermi Number of nearly zero (0.074074), and I'm not an accelerator physicist.[6] I'm supposedly a materials scientist, so I looked at the Fermi Numbers of 347 authors of recently posted materials science articles on the arXiv preprint server; and also the Fermi Numbers of the members of the US House of Representatives. Histograms of these numbers appear in the following figure.
By any other name would smell as sweet..."
Histograms of Fermi Number occurrence in three populations. 1) The cited Fermilab paper. 2) Authors of materials science papers on the arXiv preprint server. 3) Members of the current US House of Representatives. The Fermi Number average of each population is 0.509, 0.530 and 0.320, respectively. Histogram plots via Gnumeric |