Rapid progress in Neural Machine Translation (NMT) systems over the last few years has focused primarily on improving translation quality, and as a secondary focus, improving robustness to perturbations (e. g. spelling).
While recent progress in the field of ML has been significant, the reproducibility of these cutting-edge results is often lacking, with many submissions lacking the necessary information in order to ensure subsequent reproducibility.
In recent years, Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations?
In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality.
no code implementations • 6 Oct 2022 • Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhijing Jin
We present a taxonomy for characterising and understanding generalisation research in NLP.
Neural Machine Translation systems built on top of Transformer-based architectures are routinely improving the state-of-the-art in translation quality according to word-overlap metrics.
Gender-bias stereotypes have recently raised significant ethical concerns in natural language processing.
Rapid progress in Neural Machine Translation (NMT) systems over the last few years has been driven primarily towards improving translation quality, and as a secondary focus, improved robustness to input perturbations (e. g. spelling and grammatical mistakes).
A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines.
1 code implementation • 13 Jan 2021 • Anuroop Sriram, Matthew Muckley, Koustuv Sinha, Farah Shamout, Joelle Pineau, Krzysztof J. Geras, Lea Azour, Yindalon Aphinyanaphongs, Nafissa Yakubova, William Moore
The first is deterioration prediction from a single image, where our model achieves an area under receiver operating characteristic curve (AUC) of 0. 742 for predicting an adverse event within 96 hours (compared to 0. 703 with supervised pretraining) and an AUC of 0. 765 for predicting oxygen requirements greater than 6 L a day at 24 hours (compared to 0. 749 with supervised pretraining).
In this work, we study the logical generalization capabilities of GNNs by designing a benchmark suite grounded in first-order logic.
We provide novel evidence that complicates this claim: we find that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i. e. they are largely invariant to random word-order permutations.
Surprisingly, we find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods, despite using a much weaker synchronization mechanism.
We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs.
This report documents ideas for improving the field of machine learning, which arose from discussions at the ML Retrospectives workshop at NeurIPS 2019.
Recently, there has been much interest in the question of whether deep natural language understanding models exhibit systematicity; generalizing such that units like words make consistent contributions to the meaning of the sentences in which they appear.
Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue.
Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings.
Recent research has highlighted the role of relational inductive biases in building learning agents that can generalize and reason in a compositional manner.
The recent success of natural language understanding (NLU) systems has been troubled by results highlighting the failure of these models to generalize in a systematic and robust way.
Neural networks for natural language reasoning have largely focused on extractive, fact-based question-answering (QA) and common-sense inference.
This article presents in detail the RLLChatbot that participated in the 2017 ConvAI challenge.
Adversarial examples can be defined as inputs to a model which induce a mistake - where the model output is different than that of an oracle, perhaps in surprising or malicious ways.
Deep neural networks have been displaying superior performance over traditional supervised classifiers in text classification.
The use of dialogue systems as a medium for human-machine interaction is an increasingly prevalent paradigm.