Compression by Content Curation:
An Approach to Audio Summarization Driven by Cognition

Ishwarya Ananthabhotla and David Ramsay

As we move towards an increasingly IoT-enabled ecosystem, we find that it is easier than ever before to capture vast amounts of audio data. However, there are many scenarios in which we may seek a "compressed" representation of an audio stream, consisting of an intentional curation of content to achieve a specific presentation -- a background soundtrack for studying or working; a summary of salient events over the course of a day; or an aesthetic soundscape that evokes nostalgia of a time and place. In this work, we present a novel, automated approach to the task of content-driven "compression", built upon the tenets of auditory cognition, attention, and memory. We expand upon the experimental findings in our previous work, which demonstrate the relative importance of higher-level gestalt and lower level spectral principles in determining auditory memory, to design corresponding computational implementations enabled by auditory saliency models, deep neural networks for audio classification, and spectral feature extraction.

We use our tool to generate a number of 30 second binaural mixes from eight-hour recordings captured in three contrasting locations at the Media Lab, and conduct a qualitative evaluation illustrating the relationship between our feature space and a user's perception of the resulting presentations. Below, you can samples mixes generated by the tool by selecting a location, a feature strategy, and a deck (left or right), and compare the different strategies against their user ratings.

Through this work, we suggest rethinking traditional paradigms of compression in favor of an approach that is goal-oriented and modulated by human cognition.

Left Right

Feature Strategy

User Ratings

Mix1 - 100% Mix2 - 100%