Abstract
Background
Despite showing remarkable performance on tasks across modalities, deep neural networks remain opaque, with limited insight on how they internally organize information. A learned hierarchy is suggested in vision (Zhou et al., 2015) and language (Tenney et al., 2019) models in deep learning.
Bregman (1990) describes the process of auditory scene analysis (ASA), where the auditory system abstracts information based on perceptual properties such as pitch and timbre, among others, in order to construct auditory streams that are then integrated to form higher order percepts like musical genre. This implies that the representation of audio goes through a series of hierarchical abstractions in the process. While hierarchical representations are indicated in human auditory processing (Kell et al., 2018), the same has not been thoroughly demonstrated in deep learning models for audio.
Aims
We study how convolutional neural networks (CNNs) represent the hierarchy of audio tasks, hypothesizing that representations at earlier layers of a CNN perform better at lower level tasks, and those at later layers perform better at higher level tasks.
Methods
On the basis of Bregman’s ASA model, we choose tasks that are arranged in a hierarchy from the domains of speech and music. We select three hierarchical musical tasks - note identification, instrument classification, and genre classification, representative of low-, mid-, and high-level tasks. We use class-balanced subsets of NSynth (Engel et al., 2017) with 1800 instances across 12 classes, Medley-solos-DB (Lostanlen et al., 2019) with 965 instances across 7 classes , and GTZAN (Tzanetakis et al., 2001) with 1000 instances across 10 classes, for each of the respective tasks.
We inspect three models - VGGish (Hershey et al., 2017), CLAP (Elizalde et al., 2023), and MobileNetV3 (Schmid et al., 2023). We use k-Nearest Neighbour classifiers with class-balanced five-fold cross-validation to assess the model's accuracy on each task with intermediate representations extracted at six equally spaced convolutional blocks and the first fully-connected layer. We repeat this for three hierarchically related speech tasks - consonant classification, keyword recognition, and speaker count estimation, using similarly class-balanced subsets of PCVC (Malekzadeh et al., 2020) with 1794 instances across 23 classes, Speech Commands (Warden, 2018) with 1750 instances across 35 classes, LibriCount (Stöter et al., 2018) with 1100 instances across 11 classes respectively.
Results
Our results show that the CNN’s early layers learned low-level tasks and later layers learned high-level tasks even without explicit training. On CLAP, for note identification, the first layer’s accuracy (47%) is greater than later layers (<30%). For instrument classification, we see peak accuracy at the last layer (97%), but a remarkable change occurs at the fourth layer (71%) compared to the third layer (55%). Genre classification accuracy at the last layer (75%) is better than earlier layers (<66%). These trends hold across all models. We see similar trends for the speech tasks.
Discussion
The results support our hypothesis, providing strong evidence that CNNs implicitly learn a hierarchy in sound. This mirrors the human brain’s hierarchical encoding and deserves further investigation if other deep learning architectures and training methodologies also encode similar hierarchies. We also see how low-level tasks are implicitly learnt.