We advance the state of the art in polyphonic piano music transcription by
using a deep convolutional and recurrent neural network which is trained to
jointly predict onsets and frames. Our model predicts pitch onset events and
then uses those predictions to condition framewise pitch predictions...
inference, we restrict the predictions from the framewise detector by not
allowing a new note to start unless the onset detector also agrees that an
onset for that pitch is present in the frame. We focus on improving onsets and
offsets together instead of either in isolation as we believe this correlates
better with human musical perception. Our approach results in over a 100%
relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized
audio which results in more natural-sounding transcriptions.