Paul Primus, Gerhard Widmer,
"Improved Zero-Shot Audio Tagging &Classification with Patchout Spectrogram Transformers"
: Proceedingsof the 30th European Signal Processing Conference (EUSIPCO 2022), 7-2022
Improved Zero-Shot Audio Tagging &Classification with Patchout Spectrogram Transformers
Sprache des Titels:
Proceedingsof the 30th European Signal Processing Conference (EUSIPCO 2022)
Standard machine learning models for tagging and classifyingacoustic signals cannot handle classes that were not seen duringtraining. Zero-Shot (ZS) learning overcomes this restriction bypredicting classes based on adaptable class descriptions. This studysets out to investigate the effectiveness of self-attention-basedaudio embedding architectures for ZS learning. To this end, wecompare the very recent patchout spectrogram transformer with twoclassic convolutional architectures. We evaluate these threearchitectures on three tasks and on three different benchmarkdatasets: general-purpose tagging on AudioSet, environmental soundclassification on ESC-50, and instrument tagging on OpenMIC. Ourresults show that the self-attention-based embedding methodsoutperform both compared convolutional architectures in all of thesesettings. By designing training and test data accordingly, weobserve that prediction performance suffers significantly when the`semantic distance' between training and new test classes is large,an effect that will deserve more detailed investigations.