Search Results for author: Vilém Zouhar

In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen.

Machine Translation

Paper
Add Code

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

1 code implementation • 14 Feb 2024 • Sankalan Pal Chowdhury, Vilém Zouhar, Mrinmaya Sachan

Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation.

Language Modelling Math +2

Paper
Code

Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing

1 code implementation • 29 Jan 2024 • Vilém Zouhar

On the machine translation task, we explore (1) whether the choice of the vocabulary plays a role in model stealing scenarios and (2) if it is possible to extract the victim's vocabulary.

Knowledge Distillation Machine Translation +1

Paper
Code

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies

no code implementations • 12 Jan 2024 • Tom Kocmi, Vilém Zouhar, Christian Federmann, Matt Post

Ten years ago a single metric, BLEU, governed progress in machine translation research.

Machine Translation Translation

Paper
Add Code

Quality and Quantity of Machine Translation References for Automatic Metrics

1 code implementation • 2 Jan 2024 • Vilém Zouhar, Ondřej Bojar

Automatic machine translation metrics typically rely on human translations to determine the quality of system translations.

Machine Translation Translation

Paper
Code

Evaluating Optimal Reference Translations

1 code implementation • 28 Nov 2023 • Vilém Zouhar, Věra Kloudová, Martin Popel, Ondřej Bojar

The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good.

Machine Translation Translation

Paper
Code

RELIC: Investigating Large Language Model Responses using Self-Consistency

no code implementations • 28 Nov 2023 • Furui Cheng, Vilém Zouhar, Simran Arora, Mrinmaya Sachan, Hendrik Strobelt, Mennatallah El-Assady

To address this challenge, we propose an interactive system that helps users gain insight into the reliability of the generated text.

Language Modelling Large Language Model

Paper
Add Code

A Diachronic Perspective on User Trust in AI under Uncertainty

1 code implementation • 20 Oct 2023 • Shehzaad Dhuliawala, Vilém Zouhar, Mennatallah El-Assady, Mrinmaya Sachan

In a human-AI collaboration, users build a mental model of the AI system based on its reliability and how it presents its decision, e. g. its presentation of system confidence and an explanation of the output.

Paper
Code

Tokenization and the Noiseless Channel

1 code implementation • 29 Jun 2023 • Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya Sachan, Ryan Cotterell

Subword tokenization is a key part of many NLP pipelines.

Machine Translation

Paper
Code

A Formal Perspective on Byte-Pair Encoding

1 code implementation • 29 Jun 2023 • Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell

Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$-approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$.

Combinatorial Optimization

Paper
Code

Revisiting Automated Topic Model Evaluation with Large Language Models

1 code implementation • 20 May 2023 • Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, Mrinmaya Sachan, Elliott Ash

Topic models are used to make sense of large text collections.

Topic Models

Paper
Code

Enhancing Textbooks with Visuals from the Web for Improved Learning

1 code implementation • 18 Apr 2023 • Janvijay Singh, Vilém Zouhar, Mrinmaya Sachan

We release the dataset of textbooks with an associated image bank to inspire further research in this intersectional area of computer vision and NLP for education.

Math

Paper
Code

PWESuite: Phonetic Word Embeddings and Tasks They Facilitate

1 code implementation • 5 Apr 2023 • Vilém Zouhar, Kalvin Chang, Chenxuan Cui, Nathaniel Carlson, Nathaniel Robinson, Mrinmaya Sachan, David Mortensen

Mapping words into a fixed-dimensional vector space is the backbone of modern NLP.

Retrieval Word Embeddings

Paper
Code

Multimodal Shannon Game with Images

no code implementations • 20 Mar 2023 • Vilém Zouhar, Sunit Bhattacharya, Ondřej Bojar

To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2).

Language Modelling Sentence

Paper
Add Code

Poor Man's Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

1 code implementation • 21 Jan 2023 • Vilém Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim, Tom Kocmi, Yuchen Eleanor Jiang, Mrinmaya Sachan

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference.

Machine Translation Sentence +1

Paper
Code

Sentence Ambiguity, Grammaticality and Complexity Probes

1 code implementation • 13 Oct 2022 • Sunit Bhattacharya, Vilém Zouhar, Ondřej Bojar

It is unclear whether, how and where large pre-trained language models capture subtle linguistic traits like ambiguity, grammaticality and sentence complexity.

Sentence Sentence Ambiguity

Paper
Code

Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

1 code implementation • 4 Aug 2022 • Vilém Zouhar, Marius Mosbach, Dietrich Klakow

We present an LSTM-based autoregressive language model which uses prefix embeddings (from a pretrained masked language model) via fusion (e. g. concatenation) to obtain a richer context representation for language modelling.

Language Modelling Sentence +1

Paper
Code

EMMT: A simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios

1 code implementation • 6 Apr 2022 • Sunit Bhattacharya, Věra Kloudová, Vilém Zouhar, Ondřej Bojar

We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants.

EEG Sentence +1

Paper
Code

Knowledge Base Index Compression via Dimensionality and Precision Reduction

1 code implementation • SpaNLP (ACL) 2022 • Vilém Zouhar, Marius Mosbach, Miaoran Zhang, Dietrich Klakow

Finally, we show that it is possible to combine PCA with using 1bit per dimension.

Dimensionality Reduction Question Answering +1

Paper
Code

Artefact Retrieval: Overview of NLP Models with Knowledge Base Access

no code implementations • AKBC Workshop CSKB 2021 • Vilém Zouhar, Marius Mosbach, Debanjali Biswas, Dietrich Klakow

Many NLP models gain performance by having access to a knowledge base.

Fact Checking Question Answering +1

Paper
Add Code

Neural Machine Translation Quality and Post-Editing Performance

1 code implementation • EMNLP 2021 • Vilém Zouhar, Aleš Tamchyna, Martin Popel, Ondřej Bojar

We test the natural expectation that using MT in professional translation saves human processing time.

Machine Translation NMT +1

Paper
Code

Backtranslation Feedback Improves User Confidence in MT, Not Quality

1 code implementation • NAACL 2021 • Vilém Zouhar, Michal Novák, Matúš Žilinec, Ondřej Bojar, Mateo Obregón, Robin L. Hill, Frédéric Blain, Marina Fomicheva, Lucia Specia, Lisa Yankovskaya

Translating text into a language unknown to the text's author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility.

Machine Translation Translation

Paper
Code

Sampling and Filtering of Neural Machine Translation Distillation Data

1 code implementation • 1 Apr 2021 • Vilém Zouhar

In most of neural machine translation distillation or stealing scenarios, the goal is to preserve the performance of the target model (teacher).

Machine Translation Translation

Paper
Code

Leveraging Neural Machine Translation for Word Alignment

no code implementations • 31 Mar 2021 • Vilém Zouhar, Daria Pylypenko

The most common tools for word-alignment rely on a large amount of parallel sentences, which are then usually processed according to one of the IBM model algorithms.

Decoder Machine Translation +4

Paper
Add Code

Outbound Translation User Interface Ptakopet: A Pilot Study

1 code implementation • 25 Nov 2019 • Vilém Zouhar, Ondřej Bojar

It is not uncommon for Internet users to have to produce a text in a foreign language they have very little knowledge of and are unable to verify the translation quality.

Translation

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.