Musical Onset Detection and the S-Transform

For a while throughout my last year of the masters program, I was thinking left right and center for a good project. I thought about several things, robotics, image processing, or an interesting gadget.

I landed on a brilliant idea! Why not do something about music, which I absolutely love. After racking my brain, I thought of designing an interesting and intelligent musical indexing and retrieval system – I will talk about this on a later post.

Upon doing some preliminary work, me and my supervisor realized that this is too much; there were so many things to consider – musical information retrieval, machine learning and sentiment analysis were some of the topics involved. The first stage of the work seemed sufficient in breadth and depth for a good research, so we stuck with it.

The research was a simple idea, but complicated in operation – to design a system to detect Musical Onsets. A musical onset has a broad range of definitions, Without going too technical, an onset, as far as this research in concerned, is a single instant in time used to represent a transient, which is the duration through which the signal undergoes a rapid and unpredictable evolution – refer figure below;

Attack ,transient, Onset, and Decay in the case of an isolated musical note

The detection of musical onsets is a crucial part of many musical information retrieval systems – including the likes of beat tracking and matching, indexing, fingerprinting and hashing. Of numerous existing methods of onset detection, there are several common limitations. A primary one being the genre specific accuracy. Onset detection, and subsequent operations like beat tracking have very high accuracy rates for genres where the “beat” is prominent – such as dance or rock music, while having significantly poorer performance for musical styles such as jazz or ensemble music.

One of the primary components of an onset detection system is the time-frequency analysis. The most commonly used method for this is the Short Time Fourier Transform. This method works well in a vast majority of cases, but an inherent quality of the STFT – which is its constant window size dependent resolution, may sometimes dampen lower frequency components, thereby rendering an inaccurate onset detection – refer images below;

STFT of a signal with varying window sizes

We chose to use the S-transform instead of the commonly used STFT to overcome this limitation. As evident by the two figures below, the s-transform does a better job at capturing lower frequencies due to its dynamic window size. The successful capture of lower frequencies is a crucial step in onset detection.

STFT of a discrete signal
S-Transform of the same signal

This limitation is irrelevant in music where there are strong beats present (refer image below), but for some styles of music (such as classical music), the beats are not as prominent, and a successful capture is needed.

Waveform of a dance music segment (top), and a classical music segment (bottom) – note the prominence of beats in the top figure

The method we proposed was teste don several publicly available datasets and the results were good. Although a few methods, which rely on other techniques outperformed the proposed method by a narrow margin, the computational performance of the proposed method was better.

Detailed explanations of the experiment, as well as datasets and results can be viewed in the IEEE publication at LINK.