Frequency of words by first letter in the complete works of William Shakespeare. (Graphing via Gnumeric, from data found at Open Source Shakespeare). |
• English - 361 billionThe study used Google word data for English, Spanish, and Hebrew texts for the period 1800-2008. The combined corpus had 107 distinct words. That's ten million for three languages, or about 3.3 million words per language. Since the OED has just 0.6 million words, why the five-fold discrepancy? Google's count must include a huge number of misspelled words as distinct words. Most of these likely derive from errors in the optical character recognition process. Also, it appears that every number (e.g., 1234) is listed as a distinct word, including currency values ($12.34). Fortunately, none of the errors or inclusion of the numbers affect the paper's analysis or conclusions. One example of the extinction process involves regular and irregular verb forms. Irregular forms will often become regularized over time, and previous work shows that irregular verbs which are often used are less likely to become regularized. The half-life of an irregular verb was found to scale with the frequency of its usage. An irregular verb that is used a hundred times less frequently regularizes ten times faster. Quantitatively, irregular verb death scales as (1/√r), where r is the verb's relative use.[7] This equilibrium between word birth and death has many recent examples. Early in my career, I would often use the words, memo and memorandum. I rarely use those words, now, but I regularly use blog and email, which are words I never used twenty years ago. Often, shorter words will trump longer words over time; and scientific words will converge on their English form. Both of these ideas are demonstrated in the figure, below, that shows how the term, Roentgenogram, has been displaced by the simpler, Xray.
• French - 45 billion
• Spanish - 45 billion
• Russian - 35 billion
• Chinese - 13 billion
• Hebrew - 2 billion
"X" marks the spot. The term, Xray, snuffing Roentgenogram to extinction. (Fig. 1 of Ref. 1, via arXiv Preprint Server). |
"Our results support the intriguing concept that a language's lexicon is a generic arena for competition which evolves according to selection laws that are related to social, technological, and political trends... just as firms compete for market share leading to business opportunities, and animals compete for food and shelter leading to reproduction opportunities, words are competing for use among the books that constitute a corpus."