Hostname: page-component-6766d58669-h8lrw Total loading time: 0 Render date: 2026-05-14T11:29:00.866Z Has data issue: false hasContentIssue false

Immersive audio, capture, transport, and rendering: a review

Published online by Cambridge University Press:  16 September 2021

Xuejing Sun*
Affiliation:
Twirling Technologies, 18 Suzhou Street, Suite 1606, Beijing, China
*
Corresponding author: Xuejing Sun Email: sunxuejing@twirlingvr.com

Abstract

Immersive audio has received significant attention in the past decade. The emergence of a few groundbreaking systems and events (Dolby Atmos, MPEG-H, VR/AR, AI) contributes to reshaping the landscape of this field, accelerating the mass market adoption of immersive audio. This review serves as a quick recap of some immersive audio background, end to end workflow, covering audio capture, compression, and rendering. The technical aspects of object audio and ambisonic will be explored, as well as other related topics such as binauralization, virtual surround, and upmix. Industry trends and applications are also discussed where user experience ultimately decides the future direction of the immersive audio technologies.

Information

Type
Industrial Technology Advances
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2021. Published by Cambridge University Press in association with Asia Pacific Signal and Information Processing Association
Figure 0

Fig. 1. Channel-based audio system, courtesy (https://lab.irt.de/demos/object-based-audio/).

Figure 1

Fig. 2. 5.1 loudspeaker placement.

Figure 2

Fig. 3. Conceptual overview of object-based audio production and consumption, figure courtesy (https://lab.irt.de/demos/object-based-audio/).

Figure 3

Fig. 4. Cartesian coordinate system.

Figure 4

Fig. 5. Cube with example coordinates, figure courtesy [2].

Figure 5

Fig. 6. Spherical coordinate system for first-order ambisonic.

Figure 6

Fig. 7. A schematic representation of an FOA soundfield microphone.

Figure 7

Fig. 8. Polar patterns of third-order ambisonic channels.

Figure 8

Fig. 9. Soundfield manipulation, figure courtesy Politis [5].

Figure 9

Fig. 10. 3Dio Omni Pro binaural microphone, courtesy (https://3diosound.com/).

Figure 10

Fig. 11. Eigenmike from M.H. Acoustics.

Figure 11

Fig. 12. Progressive plane wave reconstruction with ambisonic orders M = 1 to 3 (left to right, top to bottom). The boundary of well-reconstructed area is shown as a constant-error contour, figure courtesy [8].

Figure 12

Table 1. Limit frequencies film of the acoustic reconstruction at a centered listener ears. Predicted angle αE of the blur width of the phantom image.

Figure 13

Fig. 13. Noise amplification curves for different HOA orders (http://research.spa.aalto.fi/projects/spharrayproc-lib/spharrayproc.html).

Figure 14

Fig. 14. Mean square error of plane wave reconstruction, courtesy [8].

Figure 15

Table 2. A comparison of HOA microphones.

Figure 16

Fig. 15. Equal segment microphone array (ESMA), courtesy Lee [10].

Figure 17

Fig. 16. Dolby Atmos mixing interface.

Figure 18

Fig. 17. 5.1 encoding structure, courtesy Breebaart et al. [15, 16].

Figure 19

Fig. 18. Multichannel compression based on KLT transform, courtesy Dai [17].

Figure 20

Fig. 19. Spatially Squeezed Surround Audio Coding, courtesy Cheng et al. [18].

Figure 21

Fig. 20. MPEG-H decoding structure, courtesy Herre et al. [21].

Figure 22

Fig. 21. MPEG-H HOA encoding structure, courtesy Sen et al. [23].

Figure 23

Fig. 22. MPEG-H HOA layered decoding, courtesy Sen et al. [23].

Figure 24

Fig. 23. Dolby Atmos overview, courtesy Dolby (https://professional.dolby.com/content-creation/dolby-atmos/2).

Figure 25

Fig. 24. Joint object coding, courtesy Purnhagen et al. [24].

Figure 26

Fig. 25. Spatial coding of object audio, courtesy Breebaart et al. [25].

Figure 27

Fig. 26. SMPTE 2098 bitstream, courtesy SMPTE [27].

Figure 28

Fig. 27. VBAP panning example, courtesy Pukki [22].

Figure 29

Fig. 28. Ambisonic panning function, courtesy [29].

Figure 30

Fig. 29. Impulse response anatomy, courtesy [38].

Figure 31

Fig. 30. A schematic representation of binaural signals rendering through two loudspeakers courtesy (http://pcfarina.eng.unipr.it/Aurora/crostalk.htm).

Figure 32

Fig. 31. Building blocks of an XTC system, courtesy [48].

Figure 33