"Modelling Emotional Expression in Music Using Interpretable and Transferable Perceptual Features"
Modelling Emotional Expression in Music Using Interpretable and Transferable Perceptual Features
Sprache des Titels:
Emotional expression is one of the most important elements underlying humans? intimate relationship with music, and yet, it has remained one of the trickiest attributes of music to model computationally. Its inherent subjectivity and context dependence renders most machine learning methods unreliable outside a very narrow domain. Practitioners find it hard to gain confidence in the models they train, which makes deploying these models to user-facing applications (such as recommendations that drive modern digital streaming platforms) problematic.
One approach to improving trust in models is through the path of explainability. Looking specifically at deep end-to-end music emotion models, a fundamental challenge that one faces is that it is not clear how the explanations for such models might make sense to humans ? are they even musically meaningful in any way? We know that humans perceive music across multiple semantic levels ? from individual sonic events and sound texture to overall musical structure. Therein lies the motivation for making explanations meaningful using features that represent an intermediate level of musical perception.
This thesis focuses on mid-level perceptual features and their use in modelling and explaining musical emotion. We propose an explainable bottleneck model architecture and show that mid-level features provide an intuitive and effective feature space for predicting perceived emotion in music, as well as explaining music emotion predictions (?Perceive?). We further demonstrate how we can extend these explanations by using interpretable components from the audio input to explain the mid-level feature values themselves, thereby tracing the predictions of a model back to the input (?Trace?). Next, we use mid-level features to tackle the elusive problem of modelling subtle expressive variations between different interpretations/performances of a set of piano pieces. However, given that the original dataset for learning mid-level features contains few solo piano music clips, a model trained on it cannot be transferred to piano music directly. To achieve this, we propose an unsupervised domain adaptation pipeline to adapt our model for solo piano pieces (?Transfer?). Compared to other feature sets, we find that mid-level features are better suited to model performance-specific variations in emotional expression (?Disentangle?). Finally, we provide a direction for future research in mid-level feature learning by augmenting the feature space with algorithmic analogues of perceptual speed and dynamics, two features that are missing in the present formulation and datasets, and use a model incorporating these new features to demonstrate emotion prediction on a recording of a well- known musician playing and modifying a melody according to specific intended emotions (?Communicate?).