You don't need a crystal ball to know that immersive audio is in our future, but it won't flourish without standardizing its creation.

At the AES ( Audio Engineering Society ) 57th International Conference on the Future of Audio Entertainment Technology, held March 6-8, 2015, at the TCL Chinese Theatres in Hollywood, CA, an entire day was devoted to immersive audio. The opening keynote of the day was given by Francis Rumsey, a consultant, journalist, AES Fellow, and chair of the AES Technical Council, who talked about the status and challenges of immersive audio. He started by reviewing the development of immersive audio, including experiments by Harvey Fletcher in the 1930s, Alan Blumlein's work on phantom imaging with two speakers around the same time, Fantasound (developed for Disney's animated classic Fantasia) in 1939, Cinerama in the 1950s, Imax and  Dolby Stereo  in the 1970s, digital surround formats in the 1990s, and Dolby Atmos fully immersive audio in 2012.

Rumsey quoted from an article in the August 1941 issue of the SMPTE (Society of Motion Picture and  Television  Engineers) Journal that remains relevant today: "The public has to hear the difference and then be thrilled by it, if our efforts toward the improvement of sound-picture quality are to be reflected at the box office. Improvements perceptible only through direct A-B comparisons have little box-office value." The same is also generally true in many areas of AV technology—such as UHD and high-res audio—when applied to adoption and sales in the marketplace, not just box-office revenues.

A quote about Fantasound from the same article makes another important point: "While dialog is intelligible and music is satisfactory, no one can claim that we have even approached perfect simulation of concert hall or live entertainment. I might be emphasized that perfect simulation of live entertainment is not our objective. Motion picture entertainment can evolve far beyond the inherent limitations of live entertainment." For many audiophiles and cinephiles, the ultimate objective has long been to reproduce the experience of a live event as closely as possible, but audio reproduction can go way beyond that, offering endless creative possibilities.

Channels, Objects, Scenes

Throughout the rest of the day, it became clear that immersive-audio systems can use one or more of three basic approaches: channel-based, object-based, and scene-based. Channel-based systems are well known—the audio content is mixed to a certain number of channels and delivered to corresponding speakers. This is how our current 5.1 and 7.1 content—not to mention 2-channel stereo—is created, and it has the advantage of a well-defined speaker layout that the content creator and consumer both use.

Object-based systems treat each sound source—helicopter, bird, voice, etc.—as a separate monaural  "object,"  which is placed in the mix according to its position and motion in the soundfield. The sound of the object itself is PCM audio, while the information about its location, motion, apparent size, and other attributes are encoded in metadata that a renderer decodes so it can send the audio to the appropriate speaker(s). In this case, the speaker layout is not well defined; instead, the renderer must "know" the location and capabilities of each speaker in a given system to properly reproduce the soundfield.

Interestingly, an object can also be a particular element of the entire mix—for example, the dialog track or the announcer at a sporting event. In addition, an object can be one static channel in a conventional surround mix; such objects might be used to form a 5.1 or 7.1 "bed" with other individual objects moving around the  soundfield . This is how most Dolby Atmos mixes are constructed.

Finally, scene-based audio attempts to capture and encode an entire soundfield of a "scene" such as a live performance or the ambient sounds of a particular geographic location. The most common method is to use a multi-microphone array that captures sounds coming from various directions. One variation of this approach is called Ambisonics, which is a three-dimensional extension of mid/side (M/S) stereo miking, adding difference channels for height and depth to the difference channel used to encode width with an M/S stereo mic.

The mic array is sometimes characterized by its "order"—the higher the order, the more microphone elements are used, and the more accurately a  soundfield  can be reproduced. High-order Ambisonics (HOA) is required to capture enough spatial resolution to be useful in cinema soundtracks.

In this visual representation of different orders of Ambisonics, you can see the components in the combined signal; the polarity of the dark gray areas is inverted compared with the polarity of the light gray areas. This illustration extends to third-order Ambisonics in the bottom row. Notice that the top image corresponds to the pickup pattern of an omnidirectional mic, while the second row resembles the pickup patterns of a figure-eight mic in three different orientations. These images are also called spherical harmonics.

Like object-based systems, scene-based systems do not rely on a specific speaker layout, though one technique is to arrange speakers in a configuration that reflects the mic positions and assign each speaker to the corresponding channel from the mic array. Another approach to scene-based audio reproduction is called wave-field synthesis (WFS), in which many speakers combine to simulate the wavefronts as picked up by the mic array; this is the basis of the Iosono system, now owned by Barco (which also owns Auro).

Atmos, Auro 3D, MPEG-H, DTS:X

One session included Brett Crockett from Dolby, Wilfried Van Baelen from Auro, Schuyler Quackenbush from the MPEG-H Alliance, and Jean-Marc Jot from DTS to present their respective immersive-audio systems. Most AVS members know Dolby Atmos well—it's an object-based system that started in commercial cinemas in 2012 and became available for the home last year. The home version can support up to 24 speakers around the listener and 10 overhead speakers, though the renderer can decode an Atmos bitstream to whatever speakers are available. It can also support up to 128 objects, some of which are often assigned to static channels in the common 5.1 or 7.1 surround layout as a bed. In Dolby nomenclature, a 7.1 system with four overhead speakers is designated as 7.1.4.

This Dolby Atmos speaker layout uses four overhead speakers and a 7.1 surround system; it's known as 7.1.4.

Auro 3D also started life as a cinema-sound format, and it's now available for the home as well. The home version is channel-based, though there are many examples of Auro 3D recordings made with a scene-based microphone array and intended for playback on a  speaker system  that reflects the positions of those  mics . The recommended home-speaker layout includes a standard 5.1 surround system with four speakers above the left and right front and surround speakers at an elevation of 30°; this is called a 9.1 Auro. In larger rooms, a single speaker can be placed on the ceiling—it's called the Voice of God speaker—to make a 10.1 Auro system. Other layouts are also possible.

In home systems, Auro recommends putting the height speakers above the front and surround left and right speakers at a 30° elevation. Depicted here is a 9.1 Auro system.

MPEG-H is designed primarily for broadcast and streaming content, not commercial cinema, and it supports channel-, object-, and scene-based HOA content. (Actually, MPEG-H also includes 4K video coding and IP delivery elements as well as immersive audio.) In the demos I've heard, objects are often user-controllable tracks, such as a sports announcer, commentary, or alternate-language tracks that the user can select and control the volume independently from other elements in the soundtrack. The renderer can compensate for different speaker layouts, even problematic layouts, with psychoacoustic techniques, which can also be used for headphone virtualization of immersive audio.

During the conference, the MPEG-H speaker layout was often illustrated with a diagram that represent's NHK's 22.2 layout, but it can use just about any layout.

Then there's DTS:X, the long-awaited immersive-audio codec that we're still waiting for. At CES, we were told that more details would be forthcoming in March, but at the AES conference, Jean-Marc Jot said it would be "four to six weeks," which could push any announcements into April. As we already know, MDA (Multi-Dimensional Audio) is used to create immersive audio, and it supports channel-, object-, and scene-based approaches, while DTS:X is used to deliver that content to home users. The recommended speaker layout is unknown, though we were told that DTS:X offers flexible, scalable channel configurations and supports up to 32 channels/objects. Also,  AV receiver  brands representing 90% of the market have pledged their support for DTS:X.

Standardizing Immersive Audio Creation

I loved Francis Rumsey's opening statement on immersive-audio day: "The good thing about standards is that there are so many of them." This is obviously a huge challenge for the widespread adoption of immersive audio, especially on the content-creation side. Of particular concern to studios is the need to create and deliver multiple mixes in different formats depending on the system that will play it back.

Sony  Pictures sound engineer Brian Vessa moderated a panel on cinema-delivery standards for immersive audio, an area he is working on within  SMPTE . He started by giving an overview of how audio is mixed and delivered to commercial cinemas today.

With 5.1 and 7.1 soundtracks, the mix master is prepared separately for commercial cinemas,  Blu-ray  authoring, and streaming delivery as well as being archived for future audio formats.

Once the DCP (Digital Cinema Package) with a 5.1 or 7.1 surround mix gets to the cinema, it is played in a straightforward manner.

Each immersive-audio system handles audio and metadata differently. Dolby Atmos sends all  data
 from the playout server to the renderer over a proprietary digital link, while Auro sends its audio over AES3 digital-audio connections, and MDA uses both AES3 and Ethernet connections.

Currently, separate mixes must be created for the different immersive-audio systems in commercial cinemas and home systems. This is a nightmare for studios, and it requires different playback systems to  monitor  each type of mix.

The playback of an immersive soundtrack is straightforward—as long as the theater gets the correct DCP for its particular  audio system .

Brian then explained his vision for an interoperable immersive-audio standard to address these problems. The mix would be performed once (with a possible modification for home delivery) and saved in a standardized mix file. That file would be delivered in the DCP to commercial cinemas, which would play it from a standardized server to the renderer of any current or future immersive-audio audio system.

An interoperable immersive-audio workflow, the mix would need to be performed only once, with a possible adaptation for home delivery.

When the interoperable immersive-audio data arrives at the theater, it would be played from a next-generation server and sent to the renderer for any of the current or future immersive-audio formats.

It was often said at the conference that the current state of immersive audio is not unlike the Wild West—few laws and everyone for themselves. Clearly, what's needed is a standard method for creating an immersive soundtrack that can be rendered by any of the systems now or soon to be available. We've lived with multiple codecs in AVRs and pre/pros and on Blu-rays for a long time, and that isn't likely to change as immersive audio becomes more common. So it's up to the content creators to find a way to create content that any of these systems can play effectively. Is that even possible? With such smart people working on it, I have every confidence that it is.