Our submissions achieved an average MAE of 5. 72 and ranked 5th in the shared task.
The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting 'skeleton' in the sentence embedding space.
Human evaluation of machine translation normally uses sentence-level measures such as relative ranking or adequacy scales.
no code implementations • • Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, Aleš Tamchyna