To study the role of auditory filter tuning and the neural transformations for representing natural sounds, we analyzed the modulation statistics of natural sound ensembles using a physiologically-inspired auditory model. The model consists of a peripheral filterbank stage that models the initial, cochlear decomposition of a sound waveform into spectro-temporal components. A second mid-level modulation filterbank stage decomposes the cochlear spectrogram of each sound into modulation components and is inspired by the modulation decomposition thought to occur in the auditory midbrain [28 (link),29 (link)] (Fig 1). Both the peripheral and mid-level model filters are designed to match tuning characteristics observed physiologically and perceptually [8 (link),26 (link),27 ]. For comparison, we also analyze natural sounds using Fourier-based spectrographic and modulation decompositions widely used for sound analysis, synthesis, and sound recognition applications. All of the models were implemented in MATLAB and are available via GitHub (https://doi.org/10.5281/zenodo.7245908).
The selected sounds were chosen to represent two broad classes of sounds: background environmental sounds and animal vocalizations. Sounds within each category were divided into subcategories representing the specific source of the background sound or the species generating the vocalization. In all, we analyzed 29 sound categories, including 10 background sound categories, 18 vocalization categories and white noise as a reference. Example natural background sound categories included crackling fire, running water, and wind, while vocalization categories included human, parrot, and new world monkey speech/vocalizations. Each category contained 3 to 60 sound recordings lasting between 5 seconds and 203.8 seconds (average = 38.1s). The length of each recording was limited by the recorded media, but we required a total minimum category length of 90 seconds for each category to assure that sufficient averaging could be performed to adequately assess the modulation statistics. In total, we analyzed 457 sound segments totaling 4.8 hours of recording. All sounds were sampled at 44.1kHz. The complete list of the sound categories and media sources is provided in S1 Table and S1 Text.
Free full text: Click here