Rainer Kelz,
"Exploring Polyphonic Piano Transcription as an Inverse Problem"
, 8-2022

Original Titel:

Exploring Polyphonic Piano Transcription as an Inverse Problem

Sprache des Titels:

Englisch

Original Kurzfassung:

This thesis concerns itself with the problem of instrument-specific poly- phonic transcription: given an audio recording of an instrument playing a polyphonic musical piece, the task is to decompose this audio signal into a symbolic representation of individual notes, inferring start, end, note number and volume.
The specific instrument we will focus on, will be the piano. This instrument has an extensive tonal and dynamic range, and it can play many different pitches at the same time. Each key is capable of producing easily perceivable volume and timbre difference between soft and loud notes, depending on how fast it is struck.
We will approach the polyphonic transcription task by formulating different versions of it as supervised machine learning problems, and employ parameterized, nonlinear function approximators, commonly called ?neural networks?, to obtain approximate solutions to this task. The parameters of these functions will be learned by minimizing a standard objective function for multi-label problems, using several variants of stochastic gradient descent. In different words: we frame polyphonic piano transcription as an instance of a supervised sequence labeling problem, with the somewhat special property that multiple labels overlapping in time are the norm.
After establishing simple and straightforward baselines, we follow up with analyses of two major problems that are inherent to the nonlinear models approach, and affect every sequence labeling problem with multiple overlapping labels. We will characterize the ?Entanglement Problem?, which ails non-linear models that try to solve combinatorial problems, such as polyphonic piano transcription. In a majority of the cases, the models simply memorize which input combination is associated with a particular set of output labels - failing to treat the different components (notes) of the mixed input combinations (chords) as separate entities. Confronted with unseen combinations of input components at test time, we can observe a wide variety of insertion, deletion and substitution errors.
A somewhat lesser problem with high capacity nonlinear transcription systems is temporal label noise - even tiny temporal shifts in the annotations can lead to noticeable performance degradation. We deem this the lesser of the two problems, because the fix is straightforward: verify that the annotated data used for training contains as little temporal noise as possible.
Ignoring both of these problems for the time being, we employ models based on the previously established baselines, and couple them with multitask learning and probabilistic sequence modeling techniques. We infuse prior knowledge about the temporal evolution of notes into the model, instead of learning it from data, showing that this improves note level transcription performance when only relatively little data is available.
Intermittently, we discuss the feasibility of learning transcription models purely from environmental interaction, without supervision, utilizing a standard reinforcement learning approach and analyse its behavior in an environment with a large action space.
As a follow-up, we will provide arguments in favor of using invertible neural network models for piano transcription. This approach facilitates better understanding of the behavior of the function that could be learned from data, and improves the interpretability of its inferences.
We will conclude with an in-depth discussion of several theoretically appealing, yet practically unsuccessful approaches that aim to alleviate the problems ailing the neural network approach to polyphonic piano transcription.