How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval
Sprache des Vortragstitels:
13th International Society for Music Information Retrieval Conference (ISMIR 2012)
Sprache des Tagungstitel:
The principal goal of the annual Music Information Retrieval
Evaluation eXchange (MIREX) experiments is to
determine which systems perform well and which systems
perform poorly on a range of MIR tasks. However, there
has been no systematic analysis regarding how well these
evaluation results translate into real-world user satisfaction.
For most researchers, reaching statistical significance
in the evaluation results is usually the most important goal,
but in this paper we show that indicators of statistical significance
(i.e., small p-value) are eventually of secondary
importance. Researchers who want to predict the realworld
implications of formal evaluations should properly
report upon practical significance (i.e., large effect-size).
Using data from the 18 systems submitted to the MIREX
2011 Audio Music Similarity and Retrieval task, we ran
an experiment with 100 real-world users that allows us to
explicitly map system performance onto user satisfaction.
Based upon 2,200 judgments, the results show that absolute
system performance needs to be quite large for users
to be satisfied, and differences between systems have to be
very large for users to actually prefer the supposedly better
system. The results also suggest a practical upper bound of
80% on user satisfaction with the current definition of the
task. Reflecting upon these findings, we make some recommendations
for future evaluation experiments and the
reporting and interpretation of results in peer-reviewing.