The evaluation of text simplification (TS) systems remains an open challenge.
As the task has common points with machine translation (MT), TS is often
evaluated using MT metrics such as BLEU. However, such metrics require high
quality reference data, which is rarely available for TS. TS has the advantage
over MT of being a monolingual task, which allows for direct comparisons to be
made between the simplified text and its original version. In this paper, we
compare multiple approaches to reference-less quality estimation of
sentence-level text simplification systems, based on the dataset used for the
QATS 2016 shared task. We distinguish three different dimensions:
gram-maticality, meaning preservation and simplicity. We show that n-gram-based
MT metrics such as BLEU and METEOR correlate the most with human judgment of
grammaticality and meaning preservation, whereas simplicity is best evaluated
by basic length-based metrics.