Jan Schlüter, Th. Grill,
"Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks"
: Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), 10-2015
Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks
Sprache des Titels:
Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR)
In computer vision, state-of-the-art object recognition systems rely on label-preserving image transformations such
as scaling and rotation to augment the training datasets.
The additional training examples help the system to learn
invariances that are difficult to build into the model, and
improve generalization to unseen data. To the best of our
knowledge, this approach has not been systematically explored for music signals. Using the problem of singing
voice detection with neural networks as an example, we apply a range of label-preserving audio transformations to assess their utility for music data augmentation. In line with
recent research in speech recognition, we find pitch shifting to be the most helpful augmentation method. Combined with time stretching and random frequency filtering,
we achieve a reduction in classification error between 10
and 30%, reaching the state of the art on two public datasets. We expect that audio data augmentation would yield
significant gains for several other sequence labelling and
event detection tasks in music information retrieval.