Tikalon Header

Computers as Listeners and Speakers

November 4, 2013

It's usually easy to distinguish male from female voices, since most women speak and sing at higher frequencies than most men. If you do a more technical analysis of speech signals, you see that all speech information is contained in the audio frequency band below 20 kHz. Further experimentation shows that intelligible speech is contained in frequencies between 300-3400 Hz, with most of the amplitude contained between 80-260 Hz.

The frequency range of intelligible speech was very important in the definition of the
telephone system, since you wouldn't want to spend money on high frequency components that weren't required. Telephone research also led to the first form of digital encoding of speech in a system called the vocoder, patented in 1939 by Bell Labs acoustical engineer, Homer Dudley.[1]

In what was a
tour de force in the era of vacuum tube electronics, Dudley used a bank of audio filters to determine the amplitude of the speech signal in each band. These amplitudes were encoded as digital data for transmission to a remote bank of oscillators that reconstructed the signal.

The vocoder allowed
compression and multiplexing of many voice channels over a single submarine cable. Encryption of the digital data allowed secure voice communications, a technique used during World War II.

The vocoder operated on band-limited signals, independently of their origin. It was over-kill as far as speech signals are concerned, since human speech is contained in definite frequency bands called
formants (see figure). Formants arise from the way that human speech is generated. Air flow through the larynx produces an excitation signal that excites resonances in the vocal tract.

Spectrograms of vowel sounds
Spectrograms of the average female (left) and male (right) voicing of vowels. These are the English vowel sounds, 'eh' (bet), 'ee' (see), 'ah' (father), 'oh' (note), and 'oo' as in (boot). Note the overall lower frequencies of the male voice, as well as the slower male cadence. (Fig. 1 of ref. 2, licensed under a Creative Commons License.)[2]

Knowledge of the way that human speech is created allowed development of a
speech synthesis technique called formant synthesis, which is modeled on the physical production of sound in the human vocal tract. This was best developed as linear predictive coding (LPC), successfully implemented by Texas Instruments in its LPC integrated circuits. Texas Instruments used these chips in its Speak & Spell toy. My e-book reader has a very good text-to-speech feature with both male and female speakers.

Robby The RobotMy favorite talking machine, Robby The Robot, as he appeared at the 2006 San Diego Comic Con.

Robby was a character in the 1956 Movie,
Forbidden Planet, which I saw as a nine year old child.

A second favorite would be
Bender from Futurama, while my least favorite would be Twiki from Buck Rogers in the 25th Century.

(Photo by
Patty Mooney, via Wikimedia Commons.)

It should come as no surprise that research in artificial speech production has led to methods for
speech recognition. The Wikipedia list of speech recognition software includes quite a few implementations, including the every-popular Siri, Google Voice Search, and a number of free and open-source software (FOSS) packages.

Some early voice recognition software improved
reliability by having a single user speak works from a selected dictionary to calibrate the system to his voice. Modern applications try as much as possible to hide the "computer" part of computing from the user, so this is no longer done. As an episode of The Big Bang Theory shows, such voice recognition has its flaws, even in a one speaker environment. Is speech recognition of multiple speakers in a conversation even possible with today's technology?

Humans have no trouble with the task of identifying speakers in a group conversation, so how hard would it be for a computer to do the same? A team of
computer scientists in the Spoken Language Systems Group at MIT's Computer Science and Artificial Intelligence Laboratory have tackled this problem, which is termed, "speaker diarization."[3-5] Speaker diarization is the automatic determination how many speakers there are, and which of these speaks when. It would be useful for indexing and annotating audio and video recordings.[4]

A
sonic representation of a single speaker involves the analysis of more than 2,000 different speech sounds, such as the vowel sounds represented in the spectrogram, above. These can be adequately represented by about sixty variables.[4] When several speakers are involved in conversation, the diarization problem reduces to a search of a parameter space of more than 100,000 dimensions. Since you would like to avoid always needing to do diarization on a supercomputer, you need a way to reduce the complexity of the problem.[4]

As an
analogy of how such a simplification is achieved, consider the cumulative miles traveled by a train as a function of time. If we just consider the raw data, we would have a two-dimensional graph of miles (y) vs time (x), represented by a straight line. If we execute a mathematical transformation to rotate the graph to place the line at the x-axis, then all the variation happens along the x-axis, and we eliminate one of the two dimensions. The MIT research team's approach is to find the "lines" in the parameter space that encode most of the variation.[4]

Figure captionA representation of the cluster analysis of multiple speakers.

(Still image by Stephen Shum from a YouTube Video.[5])

Stephen Shum, a graduate student of Electrical Engineering and Computer Science at MIT, and the lead author of the paper describing the technique, found that a 100-dimension approximation of the parameter space was an adequate representation. In any given conversation, not all speech sounds are used, so a single recording might need just three variables to classify all speakers.[4]

Shum's system starts with an assumption that there are fifteen speakers, and it uses an
iterative process to reduce the number by merging close clusters until the actual number of speakers is reached.[4] The technique was tested with the multi-speaker CallHome telephone corpus.[3]

References:

  1. Homer W Dudley, "Signal transmission," US Patent No. 2,151,091, March 21, 1939.
  2. Daniel E. Re, Jillian J. M. O'Connor, Patrick J. Bennett and David R. Feinberg, "Preferences for Very Low and Very High Voice Pitch in Humans," PLoS ONE, vol. 7, no. 3 (March 5, 2012), Article No. e32719.
  3. Stephen H. Shum, Najim Dehak, Réda Dehak and James R. Glass, "Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach," IEEE Transactions On Audio, Speech, And Language Processing, vol. 21, no. 10 (October 2013), pp. 2015-2028.
  4. Larry Hardesty, "Automatic speaker tracking in audio recordings," MIT Press Release, October 18, 2013.
  5. YouTube Video, Clustering method of Speech Recognition, Stephen Shum, October 8, 2013. The algorithm groups the points together that are associated with with a single speaker.
  6. Web Site of MIT Spoken Language Systems Group.