We cast the combinatorial problem of polyphonic piano transcription as a two stage
process. A nonlinear denoising stage maps spectrogram representations of arbitrary
piano music with unknown timbral characteristics onto a canonical spectrogram
representation with known timbral characteristics. A subsequent linear demixing
stage aims to exploit the knowledge about the canonical timbral characteristics.
The idea behind this two stage process is to try to elegantly sidestep any musical
bias inherent in the training dataset that is easily picked up by a single stage,
nonlinear (neural) transcription system (with large capacity). The two stage process
tries not to force the nonlinear system to solve a combinatorial problem, which
is more amenable to being solved by a linear decomposition method that has the
superposition property. Using the simplest setup we could think of, we obtain
(rather mixed (pun intended)) results on a standard polyphonic piano transcription
dataset ? the two stage process still suffers from generalization problems after the
first stage, which the second stage is unable to compensate.