MTB -- A new experience in sound that "puts you there"

You don your headphones and you are suddenly immersed in a conversation. Distant voices are clear, nearby voices are vividly natural. You can turn to face any of the people who are talking, and they keep their positions fixed as you turn your head. A new person comes in and you hear all of the normal environmental sounds as if you were there -- the door closing, the footsteps, the person settling into the chair. You are enjoying the somewhat eerie and new experience of audio telepresence, listening to invisible people talking while you are easily shifting your attention from one to another.

Conventional recordings cannot create such an experience, but MTB can. MTB -- which stands for motion-tracked binaural -- is a new method for capturing, recording and reproducing spatial sound that can immerse you in any remote location. Developed in the course of our spatial sound research, it is a generic audio technology with many applications, such as

  • Remote listening
    • Teleconferencing
    • Surveillance and security
    • Teleoperations
  • Recording
    • Home theater
    • Musical entertainment
    • Home recording
  • Interactive multimedia
    • Computer games
    • Training simulators
    • Virtual and augmented reality

Binaural recording and head motion

As its name implies, MTB is a form of binaural technology. It extends binaural recordings by preserving the powerful dynamic sound localization cues that come from voluntary head motion. Traditional binaural recordings are made by placing two microphones in the ears of a dummy head (see Fig. 1). To the extent that the size and shape of the listener's head matches that of the dummy head, when the microphone signals are heard over high-quality headphones, the sound signals in the listener's ears are the same as they would have been if the listener were at the location of the dummy head. Thus, binaural recordings not only capture the monaural spectral information, but they also capture the binaural differences, faithfully reproducing the various sound sources and the effects of room reverberation.

Binaural recording
Fig. 1. Traditional binaural recording

However, consider what happens when listeners turn their heads. Because the dummy head does not turn, the signals reaching the listeners' ears are unchanged. To the listeners, it is as if the acoustic world turned with them. In addition to being an unnatural experience, this has particularly unfortunate consequences for the most common situation, namely, when the dummy head is pointed to face the source of greatest interest. In that situation, the left-ear and right-ear signals are essentially the same, and they remain the same no matter how the listeners turn their heads. The result is the common perception that the source is inside the head, or even somewhat behind the head.

The same "in-the-head" perception is experienced when the source is behind the dummy head, which results in another serious problem -- front/back confusion. By taking head motion into account, MTB greatly reduces or even eliminates these two major limitations of binaural recordings -- lack of frontal externalization and front/back confusion. Combined with the ability to turn to face each source of sound, this leads to the compelling sense of "being there."

How does MTB work?

MTB solves these problems by sampling the sound field around the dummy head, and by using a head tracker to sense the listener's head orientation (see Fig. 2). In particular, the system uses the signal from the head tracker to determine the locations of the listener's ears. In general, the listener's ears will be somewhere between the microphones. The problem is to estimate the output of a virtual microphone at an ear's current location. The MTB system does this by interpolating between the signals from the nearest and the next nearest microphones.

[The basic idea behind MTB]
Fig. 2. The core components of an MTB system

How many microphones are needed?

From sampling theory, we know that we must use at least two samples per wavelength. At the high-frequency limit of human hearing (around 20 kHz), the wavelength is about 1.7 cm. Because the circumference of an average size head is about 55 cm, this would seem to require a minimum of 64 microphones. Indeed, it can be shown that simple linear interpolation requires twice that number for accurate results. That is a discouraging result for practical applications.

Fortunately, it is not necessary to interpolate through the full audible frequency band. The most powerful cue for sound localization is the interaural time difference (ITD). However, the ITD is a low-frequency cue, and becomes ineffective above about 1.5 kHz. This suggests the approach illustrated in Fig. 3, in which a low-pass filter is used to restrict the interpolation to low frequencies, and the signal from the microphone nearest the ear is used to restore the high frequencies.

[Two-band interpolation]
Fig. 3. Two-band interpolation between the signal xn(t) from the nearest microphone and the signal xnn(t) from the next-nearest microphone

By restricting the interpolation to low frequencies, we can dramatically reduce the number of microphones required. Using this approach, we have found experimentally that 8 microphones produce excellent results for speech, and 16 produce excellent results for music.

Virtual MTB and augmented audio reality

In addition to being used to capture and reproduce real spatial sounds, MTB can also be used to render computer-generated spatial sounds or surround-sound recordings originally created for loudspeaker playback. The basic concept is simple. Suppose that we want to render a virtual sound, whether it is a computer-generated sound or the sound from one loudspeaker in a surround-sound recording. We can simulate the sound that would be picked up by an MTB array if we know the transfer functions from the sound source to each of the microphones in the MTB array (see Fig. 4). The transfer functions are equivalent to the HRTF (head-related transfer function) for a sphere, for which there are both exact analytical solutions and effective filter approximations. Thus, if there are N microphones, all we have to do is to filter the source signal with N spherical-head HRTFs.

[Virtual MTB]
Fig. 4. Synthesizing virtual sound

It should be noted that the sounds generated by this method will sound very "dry," as if they were recorded in an anechoic chamber. However, with additional computation, simulated room reflections can be introduced to produce a more natural sound. These results can be mixed with real spatial sounds to provide an augmented audio reality.

The MTB listening experience and research issues

The common reaction of people who listen to an MTB recording is that it produces a remarkable sense of presence. This is particularly compelling for recordings made in moderately reverberant environments, where the ability of MTB to capture accurately the effects of room reflections is most readily perceived.

However, the listening experience is known to vary from person to person. Some people will experience to various degrees one or more of the following artifacts:

  • Elevation of sound sources that should be in the horizontal plane.
  • Small but perceptible motion of the sound sources that is correlated with head motion. For some people, the sources follow head motion, and for other people the sources move in the opposite direction.
  • For a few people under some conditions, a source appears to jump from one side of the head to the other when they turn their heads.
  • Some people report that a source seems to lose high-frequency content or "go out of focus" when they turn to face it.

Most of these problems can be traced to the fact that human heads and outer ears exhibit a variety of sizes and shapes, so that the sound field around the surface of an MTB microphone array is only a first approximation to the sound field around any individual listener's head. For optimal results, we need to be able to correct the MTB signals to accommodate to the effects of head size, torso size, pinna location, and pinna size and shape. Such corrections are referred to as customizing the system to the listener.

Our current research is directed at understanding and solving these problems. Some complete and some partial solutions are already known. As an example, we have confirmed that absence of pinnae on the recording surface is the primary source of the most frequent complaint, namely, apparent elevation of sound sources. In the case of virtual MTB, this problem can be fixed by employing an individualized HRTF (see Fig. 5). Alternatively, one could employ a good HRTF model, such as one of our structural HRTF models.

[Virtual sound using individualized HRTFs]
Fig. 5. Synthesizing virtual sound with an individualized HRTF

The top half of Fig. 6 illustrates a particularly simple structural HRTF model, in which a spherical-head model is cascaded with an isolated pinna model. Clearly, if we use a physical MTB array to capture sounds, we are missing half of the model, i.e., we are missing the pinna. We could affix physical pinnae to the microphones, but the results would then be dependent on the particular pinnae used.

[Structural models of HRTFs]
Fig. 6. Structural HRTF models

However, this suggests a simple, approximate alternative, illustrated by the bottom half of Fig. 6. Here we use the physical MTB microphone array to implement the head model, and we filter the outputs of the microphones with individualized pinna models. Of course, to use the pinna model, we need to know the direction to the sound source, and we have no way of knowing that information. However, for many applications, the sound source of greatest interest is directly ahead. By using an individualized pinna model for sound sources that are directly ahead, we greatly improve the perceived elevation for the source of greatest interest.

This alternative is only one of several possible improvements to and extensions of the MTB approach. The completely faithful reproduction of spatial sound remains a challenging problem. However, MTB provides a new approach to its solution, an enables a new level of realism in spatial sound capture and reproduction.

Press Releases on MTB technology:

  • National Science Foundation -- NSF
  • University of California -- News

For more information

V. Algazi, R. O. Duda and D. M. Thompson, "Motion-Tracked Binaural Sound," Paper 6015, AES 116th Convention, Berlin, Germany, May 2004.

J. B. Melick, V. R. Algazi, R. O. Duda, and Thompson, D. M., "Customization for personalized rendering of motion-tracked binaural sound," Paper 6225, 117th Convention of the Audio Engineering Society, San Francisco, CA, Oct. 2004.

V. R. Algazi, R. O. Duda and D. M. Thompson, "Motion-Tracked Binaural Sound," J. Aud. Eng. Soc., Vol. 52, No. 11, pp. 1142-1156 (November 2004).

V. R. Algazi, R. O. Duda and D. Thompson, US patent application 20040076301, "Dynamic binaural sound capture and reproduction" (April 22, 2004).