Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness.
However, the task of time and space complexity classification from code has not been extensively explored due to a lack of datasets, with prior endeavors being limited to Java.
We introduce the StatCan Dialogue Dataset consisting of 19, 379 conversation turns between agents working at Statistics Canada and online users looking for published data tables.
Ranked #1 on Table Retrieval on Statcan Dialogue Dataset
We train a neural model with this feedback data that can generate explanations and re-score answer candidates.
Ranked #1 on Overall - Test on FeedbackQA
One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets.
Ranked #1 on Mortality Prediction on MIMIC-III (Accuracy metric)