This paper presents QUARTZ, QUality-AwaRe machine Translation, a project led by Unbabel which aims at developing machine translation systems that are more robust and produce fewer critical errors.
We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels.
In this paper we provide a systematic comparison of existing and new document-level neural machine translation solutions.
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2021 Metrics Shared Task.
We present the joint contribution of IST and Unbabel to the WMT 2021 Shared Task on Quality Estimation.
We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels.
We present the joint contribution of IST and Unbabel to the WMT 2020 Shared Task on Quality Estimation.
This paper presents the Multilingual Artificial Intelligence Agent Assistant (MAIA), a project led by Unbabel with the collaboration of CMU, INESC-ID and IT Lisbon.
DeepSPIN is a research project funded by the European Research Council (ERC) whose goal is to develop new neural structured prediction methods, models, and algorithms for improving the quality, interpretability, and data-efficiency of natural language processing (NLP) systems, with special emphasis on machine translation and quality estimation applications.
A core ingredient in RLHF's success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs.
Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while keeping the added benefits of finetuning.
Large language models (LLMs) are becoming a one-fits-many solution, but they sometimes hallucinate or produce unreliable output.
Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score.
Split conformal prediction has recently sparked great interest due to its ability to provide formally guaranteed uncertainty sets or intervals for predictions made by black-box neural models, ensuring a predefined probability of containing the actual ground truth.
Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2).
no code implementations • 14 Aug 2023 • Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems.
Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical errors, such as deviations in entities and numbers.
Selective rationales and counterfactual examples have emerged as two effective, complementary classes of interpretability methods for analyzing and training NLP models.
Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining.
1 code implementation • 20 May 2023 • Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i. e., making them better for about 100 languages.
Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU.
no code implementations • 1 May 2023 • Patrick Fernandes, Aman Madaan, Emmy Liu, António Farinhas, Pedro Henrique Martins, Amanda Bertsch, José G. C. de Souza, Shuyan Zhou, Tongshuang Wu, Graham Neubig, André F. T. Martins
Many recent advances in natural language generation have been fueled by training large language models on internet-scale data.
Large-scale multilingual machine translation systems have demonstrated remarkable ability to translate directly between numerous languages, making them increasingly appealing for real-world applications.
Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
Neural machine translation (NMT) has become the de-facto standard in real-world machine translation applications.
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations).
1 code implementation • 13 Sep 2022 • Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C. Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte M. Alves, Alon Lavie, Luisa Coheur, André F. T. Martins
We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE).
no code implementations • 31 Aug 2022 • Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows.
Although the problem of hallucinations in neural machine translation (NMT) has received some attention, research on this highly pathological phenomenon lacks solid ground.
Semi-parametric models, which augment generation with retrieval, have led to impressive results in language modeling and machine translation, due to their ability to retrieve fine-grained information from a datastore of examples.
Despite the progress in machine translation quality estimation and evaluation in the last years, decoding in neural machine translation (NMT) is mostly oblivious to this and centers around finding the most probable translation according to the model (MAP decoding), approximated with beam search.
On the other hand, semi-parametric models have been shown to successfully perform domain adaptation by retrieving examples from an in-domain datastore (Khandelwal et al., 2021).
In this work, leveraging meta-learning techniques, we extend this idea to improve the quality of the explanations themselves, specifically by optimizing explanations such that student models more effectively learn to simulate the original model.
Trainable evaluation metrics for machine translation (MT) exhibit strong correlation with human judgements, but they are often hard to interpret and might produce unreliable scores under noisy or out-of-domain data.
Recent work has shown promising results in causal discovery by leveraging interventional data with gradient-based methods, even when the intervened variables are unknown.
In this paper, we combine the representational strengths of factor graphs and of neural networks, proposing undirected neural networks (UNNs): a flexible framework for specifying computations that can be performed in any order.
Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax.
Although proper handling of discourse significantly contributes to the quality of machine translation (MT), these improvements are not adequately measured in common translation quality metrics.
Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length.
Ranked #1 on Dialogue Generation on CMU-DoG
In contrast, for finite domains, recent work on sparse alternatives to softmax (e. g., sparsemax, $\alpha$-entmax, and fusedmax), has led to distributions with varying support.
Recent work in neural machine translation has demonstrated both the necessity and feasibility of using inter-sentential context -- context from sentences other than those currently being translated.
Neural networks and other machine learning models compute continuous representations, while humans communicate with discrete symbols.
Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences.
We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE).
Latent structure models are a powerful tool for modeling language data: they can mitigate the error propagation and annotation bottleneck in pipeline systems, while simultaneously uncovering linguistic insights about the data.
In this paper, we propose a new training strategy which replaces these estimators by an exact yet efficient marginalization.
Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e. g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation).
Ranked #36 on Visual Question Answering (VQA) on VQA v2 test-std
Structured prediction requires manipulating a large number of combinatorial structures, e. g., dependency trees or alignments, either as latent or output variables.
Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers.
Ranked #1 on Machine Translation on IWSLT2017 German-English
The combination of machines and humans for translation is effective, with many studies showing productivity gains when humans post-edit machine-translated output instead of translating from scratch.
We present the contribution of the Unbabel team to the WMT 2019 Shared Task on Quality Estimation.
Named entity recognition (NER) and entity linking (EL) are two fundamentally related tasks, since in order to perform EL, first the mentions to entities have to be detected.
Ranked #9 on Entity Linking on AIDA-CoNLL
In the Transformer model, unlike the RNN, the generation of a new word attends to the full sentence generated so far, not only to the last word, and it is not straightforward to apply the scheduled sampling technique.
Automatic post-editing (APE) seeks to automatically refine the output of a black-box machine translation (MT) system through human post-edits.
Analogously to dual-encoder architectures we develop a BERT-based encoder-decoder (BED) model in which a single pretrained BERT encoder receives both the source src and machine translation tgt strings.
We present a new neural model for text summarization that first extracts sentences from a document and then compresses them.
Despite the progress made in sentence-level NMT, current systems still fall short at achieving fluent, good quality translation for a full document.
Over the past decades, numerous loss functions have been been proposed for a variety of supervised learning tasks, including regression, classification, ranking, and more generally structured prediction.
Deep NLP models benefit from underlying structures in the data---e. g., parse trees---typically extracted using off-the-shelf parsers.
In this work, we propose the task of translating Bilingual Multi-Speaker Conversations, and explore neural architectures which exploit both source and target-side conversation histories for this task.
This paper studies Fenchel-Young losses, a generic way to construct convex loss functions from a regularization function.
2 code implementations • • Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, Alexandra Birch
We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs.
We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities.