IoSR Blog : 11 June 2014
Perceptual characteristics of 3D audio
The Institute of Sound Recording is involved in a large Engineering and Physical Sciences Research Council (EPSRC) funded project titled “S3A: Future Spatial Audio”. The project brings together expertise in diverse disciplines—including audio and video signal processing, production, and perception—across four partner institutions: the Universities of Surrey, Southampton, and Salford, and the BBC. The project aims to advance 3D sound reproduction in the home in order to bring immersive listening experiences into day-to-day consumption of audio.
The IoSR is involved in looking at the experience of a listener in a 3D audio replay system. There are many facets to this, but one of the first areas that we'll be looking into is the perception of spatial audio due to different reproduction methods.
Numerous spatial audio replay methods have been introduced since the advent of recorded sound, with differing degrees of commercial uptake. The most common format for music reproduction is two-channel stereo; however, higher channel counts have been used in an attempt to reproduce localizable sound images all around the listener or to improve the sense of envelopment. The well-known 5.1 surround sound format is an example of such a system that has gained significant uptake, but other channel-based systems including 7.1, 9.1, 10.2, and 22.2 systems have also been used. Further spatial audio systems have eschewed channel-based reproduction in favour of trying to reproduce exact sound fields; such systems include binaural recordings reproduced over headphones or loudspeakers (with crosstalk cancellation), ambisonics, or wave field synthesis.
These methods all have their individual quirks, advantages, and disadvantages. A challenge that has concerned audio researchers in recent years is that of trying to identify the perceptual characteristics (or 'attributes') that can be excited by each technique. Understanding these perceptual characteristics is necessary for a thorough evaluation of the reproduction methods, from which we can develop models to predict the quality of spatial audio reproductions systems. These models can then be used to evaluate systems in place of time-consuming listening tests.
A large number of experiments have been performed to elicit attributes that describe perceptual differences between the reproduction systems mentioned above. However, this results in a large list of attributes (a preliminary literature review revealed upwards of 40 terms, and there are undoubtedly others), and it is very difficult to determine how attributes produced in different studies relate to each other. There is obviously considerable overlap between the sets, but sometimes the same word could be used to refer to different percepts, or alternatively different words could relate to the same experience. A number of authors have attempted to consolidate attribute sets. Rumsey's scene-based paradigm  separates a spatial audio scene into sources, groups of sources, environments, and global scene parameters, and categorises attributes as 'directional' or 'immersive'. Directional attributes consider aspects such as the width of a single source ('source width'), a group of sources ('ensemble width'), or the environment ('environment width'), whilst immersive attributes include 'presence' (the sense of being present in an environment) and 'envelopment' (both due to the environment and the source positioning). Other authors have attempted to perform a similar classification using statistical techniques. Berg and Rumsey  obtained ratings of systems on twelve attributes and performed principal component analysis to extract three dimensions that could be broadly described as general factors, source factors, and room factors. Le Bagousse et al.  asked experiment participants to place attribute labels from the literature into groups of related terms and used cluster analysis to determine 'attribute families'. They determined three families: 'defaults' (referring to interfering elements or nuisances); 'space' (spatial impression-related characteristics); and 'timbre' (divided into two subcategories—sound colour and timbre/other).
In order to assess the consistency of such categorisations, I decided to replicate Le Bagousse's experiment with a different set of attributes (those that I'd found in the preliminary literature review). The experiment was run informally as part of an internal seminar day presentation; seventeen participants were asked to sort a list of forty-two attributes into groups, and to name each group. The procedure took approximately 15 minutes to perform.
Participants produced between four and eight groups (with a median of five groups per participant), giving a total of ninety-seven groups. As in Le Bagousse et al. , agglomerative hierarchical clustering was performed using the Ward linkage method. The results of clustering analysis can be visualized using dendrograms (tree diagrams) in which similar items are grouped at lower levels in the plot. A clustergram can also be used to visualize relationships in the data. The clustergram and associated dendrograms for the data obtained in this experiment are shown below.
At the highest level of the attribute dendrogram (i.e. the top plot of the figure), two clusters stand out. Looking at the attributes that fall in each cluster shows that this primary division is the well-known timbral/spatial split (see for example the work of Letowski ).
The 'timbral' cluster is fairly well defined, with eleven out of eighteen group labels (seen at the right hand side of the figure) using the term 'timbre' or some derivative.
The 'spatial' cluster is divided into approximately three groups. The cluster at the far right of the figure seems to relate to the localisation of sources: it contains attributes such as distance, elevation, localisation, depth, and width, and is labeled by participants with terms such as 'localisation', 'geometric measures', and 'source position'. The next cluster that is clearly defined (the leftmost sub-cluster of the 'spatial' cluster) is related to 'naturalness', 'believability', or the sense of 'being there'.
The remaining sub-cluster is less well-defined; it can be further broken down into two parts. One group of attributes seemingly describes the feeling of spaciousness (immersion, envelopment etc.) eliciting labels such as 'spatial quality', 'perceived spatial scene', 'environment', and 'room effect'. The last group is less clear; the attributes include focus/blur, homogeneity, penetration, emphasis, loudness, phase, and stability, and the elicited labels reflect the uncertainty with terms including 'other', 'not sure', 'objective', 'audio quality', and 'loudness' (which was often not grouped with any other term).
This analysis is in broad agreement with previous categorizations in the literature, suggesting that there is a basis for grouping similar attributes in a way that does describe listeners' underlying perceptual experience. However, such an analysis cannot tell us whether there is useful information that can be obtained by looking at individual attributes from each group, or about the relative importance of each attribute and how they contribute to the overall listening experience. This will be the focus of future research.
 Berg, J., and Rumsey, F., 2001. 'Verification and correlation of attributes used for describing the spatial quality of reproduced sound',
 Le Bagousse, S., Paquier, M., Colomes, C., 2010. 'Families of Sound Attributes for Assessment of Spatial Audio',
by Jon Francombe