Search Results for author: Ondřej Bojar

Found 95 papers, 21 papers with code

UFAL Submissions to the IWSLT 2016 MT Track

no code implementations IWSLT 2016 Ondřej Bojar, Ondřej Cífka, Jindřich Helcl, Tom Kocmi, Roman Sudarikov

We present our submissions to the IWSLT 2016 machine translation task, as our first attempt to translate subtitles and one of our early experiments with neural machine translation (NMT).

Machine Translation NMT +1

Detecting Post-Edited References and Their Effect on Human Evaluation

no code implementations EACL (HumEval) 2021 Věra Kloudová, Ondřej Bojar, Martin Popel

This paper provides a quick overview of possible methods how to detect that reference translations were actually created by post-editing an MT system.

NLPHut’s Participation at WAT2021

no code implementations ACL (WAT) 2021 Shantipriya Parida, Subhadarshi Panda, Ketan Kotwal, Amulya Ratna Dash, Satya Ranjan Dash, Yashvardhan Sharma, Petr Motlicek, Ondřej Bojar

Our submission tops in English→Malayalam Multimodal translation task (text-only translation, and Malayalam caption), and ranks second-best in English→Hindi Multimodal translation task (text-only translation, and Hindi caption).

Image Captioning Translation

FINDINGS OF THE IWSLT 2021 EVALUATION CAMPAIGN

no code implementations ACL (IWSLT) 2021 Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremerman, Roldano Cattoni, Maha Elbayad, Marcello Federico, Xutai Ma, Satoshi Nakamura, Matteo Negri, Jan Niehues, Juan Pino, Elizabeth Salesky, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Alexander Waibel, Changhan Wang, Matthew Wiesner

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2021) featured this year four shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Multilingual speech translation, (iv) Low-resource speech translation.

Translation

Constrained Decoding for Technical Term Retention in English-Hindi MT

no code implementations ICON 2021 Niyati Bafna, Martin Vastl, Ondřej Bojar

Technical terms may require special handling when the target audience is bilingual, depending on the cultural and educational norms of the society in question.

Machine Translation Sentence +1

CUNI Basque-to-English Submission in IWSLT18

no code implementations IWSLT (EMNLP) 2018 Tom Kocmi, Dušan Variš, Ondřej Bojar

We present our submission to the IWSLT18 Low Resource task focused on the translation from Basque-to-English.

Transfer Learning Translation

A Fine-Grained Analysis of BERTScore

no code implementations WMT (EMNLP) 2021 Michael Hanna, Ondřej Bojar

BERTScore, a recently proposed automatic metric for machine translation quality, uses BERT, a large pre-trained language model to evaluate candidate translations with respect to a gold translation.

Language Modelling Machine Translation +4

Findings of the 2021 Conference on Machine Translation (WMT21)

no code implementations WMT (EMNLP) 2021 Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-Jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, Marcos Zampieri

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021. In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories.

Machine Translation Translation

Multimodality for NLP-Centered Applications: Resources, Advances and Frontiers

1 code implementation LREC 2022 Muskan Garg, Seema Wazarkar, Muskaan Singh, Ondřej Bojar

With the development of multimodal systems and natural language generation techniques, the resurgence of multimodal datasets has attracted significant research interests, which aims to provide new information to enrich the representation of textual data.

Text Generation

Results of the WMT20 Metrics Shared Task

no code implementations WMT (EMNLP) 2020 Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong Ma, Ondřej Bojar

Participants were asked to score the outputs of the translation systems competing in the WMT20 News Translation Task with automatic metrics.

Translation

WMT20 Document-Level Markable Error Exploration

1 code implementation WMT (EMNLP) 2020 Vilém Zouhar, Tereza Vojtěchová, Ondřej Bojar

For an annotation experiment of two phases, we chose Czech and English documents translated by systems submitted to WMT20 News Translation Task.

Machine Translation Sentence +1

Findings of the IWSLT 2022 Evaluation Campaign

no code implementations IWSLT (ACL) 2022 Antonios Anastasopoulos, Loïc Barrault, Luisa Bentivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Maha Elbayad, Clara Emmanuel, Yannick Estève, Marcello Federico, Christian Federmann, Souhir Gahbiche, Hongyu Gong, Roman Grundkiewicz, Barry Haddow, Benjamin Hsu, Dávid Javorský, Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Mathur, Paul McNamee, Kenton Murray, Maria Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Niu, John Ortega, Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh Virkar, Alexander Waibel, Changhan Wang, Shinji Watanabe

The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation.

Speech-to-Speech Translation Translation

Quality and Quantity of Machine Translation References for Automatic Metrics

no code implementations2 Jan 2024 Vilém Zouhar, Ondřej Bojar

Automatic machine translation metrics typically rely on human translations to determine the quality of system translations.

Machine Translation Translation

Evaluating Optimal Reference Translations

1 code implementation28 Nov 2023 Vilém Zouhar, Věra Kloudová, Martin Popel, Ondřej Bojar

The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good.

Machine Translation Translation

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

no code implementations22 Oct 2023 Ivana Kvapilíková, Ondřej Bojar

Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge.

Language Modelling Sentence +2

Long-Form End-to-End Speech Translation via Latent Alignment Segmentation

no code implementations20 Sep 2023 Peter Polák, Ondřej Bojar

On a diverse set of language pairs and in- and out-of-domain data, we show that the proposed approach achieves state-of-the-art quality at no additional computational cost.

Segmentation Translation

Minuteman: Machine and Human Joining Forces in Meeting Summarization

no code implementations11 Sep 2023 František Kmječ, Ondřej Bojar

The tool provides a live transcript and a live meeting summary to the users, who can edit them in a collaborative manner, enabling correction of ASR errors and imperfect summary points in real time.

Meeting Summarization speech-recognition +1

Character-level NMT and language similarity

no code implementations8 Aug 2023 Josef Jon, Ondřej Bojar

We explore the effectiveness of character-level neural machine translation using Transformer architecture for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish.

Machine Translation NMT +2

Turning Whisper into Real-Time Transcription System

1 code implementation27 Jul 2023 Dominik Macháček, Raj Dabre, Ondřej Bojar

Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription.

speech-recognition Speech Recognition +1

Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluation

no code implementations30 May 2023 Josef Jon, Ondřej Bojar

With a combination of multiple MT metrics as the fitness function, the proposed method leads to an increase in translation quality as measured by other held-out automatic metrics.

Machine Translation Translation

Robustness of Multi-Source MT to Transcription Errors

no code implementations26 May 2023 Dominik Macháček, Peter Polák, Ondřej Bojar, Raj Dabre

Automatic speech translation is sensitive to speech recognition errors, but in a multilingual scenario, the same content may be available in various languages via simultaneous interpreting, dubbing or subtitling.

Machine Translation speech-recognition +2

Multimodal Shannon Game with Images

no code implementations20 Mar 2023 Vilém Zouhar, Sunit Bhattacharya, Ondřej Bojar

To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2).

Language Modelling Sentence

CUNI Submission in WMT22 General Task

no code implementations29 Nov 2022 Josef Jon, Martin Popel, Ondřej Bojar

We evaluate performance of MBR decoding compared to traditional mixed backtranslation training and we show a possible synergy when using both of the techniques simultaneously.

Translation

MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

1 code implementation16 Nov 2022 Dominik Macháček, Ondřej Bojar, Raj Dabre

There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET.

Machine Translation Translation

Simultaneous Translation for Unsegmented Input: A Sliding Window Approach

no code implementations18 Oct 2022 Sukanta Sen, Ondřej Bojar, Barry Haddow

In the cascaded approach to spoken language translation (SLT), the ASR output is typically punctuated and segmented into sentences before being passed to MT, since the latter is typically trained on written text.

Sentence Translation

Sentence Ambiguity, Grammaticality and Complexity Probes

1 code implementation13 Oct 2022 Sunit Bhattacharya, Vilém Zouhar, Ondřej Bojar

It is unclear whether, how and where large pre-trained language models capture subtle linguistic traits like ambiguity, grammaticality and sentence complexity.

Sentence Sentence Ambiguity

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

no code implementations LREC 2022 Peter Polák, Muskaan Singh, Anna Nedoluzhko, Ondřej Bojar

To facilitate the research in this area, we present ALIGNMEET, a comprehensive tool for meeting annotation, alignment, and evaluation.

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

no code implementations LREC 2022 Idris Abdulmumin, Satya Ranjan Dash, Musa Abdullahi Dawud, Shantipriya Parida, Shamsuddeen Hassan Muhammad, Ibrahim Sa'id Ahmad, Subhadarshi Panda, Ondřej Bojar, Bashir Shehu Galadanci, Bello Shehu Bello

The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.

Machine Translation Translation

EMMT: A simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios

1 code implementation6 Apr 2022 Sunit Bhattacharya, Věra Kloudová, Vilém Zouhar, Ondřej Bojar

We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants.

EEG Electroencephalogram (EEG) +2

Short-Term Word-Learning in a Dynamically Changing Environment

no code implementations29 Mar 2022 Christian Huber, Rishu Kumar, Ondřej Bojar, Alexander Waibel

In this paper we study, a) methods to acquire important words for this memory dynamically and, b) the trade-off between improvement in recognition accuracy of new words and the potential danger of false alarms for those added words.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Comprehension of Subtitles from Re-Translating Simultaneous Speech Translation

no code implementations4 Mar 2022 Dávid Javorský, Dominik Macháček, Ondřej Bojar

Our results show that the subtitling layout or flicker have a little effect on comprehension, in contrast to machine translation itself and individual competence.

Machine Translation Translation

The Reality of Multi-Lingual Machine Translation

no code implementations25 Feb 2022 Tom Kocmi, Dominik Macháček, Ondřej Bojar

Machine translation is for us a prime example of deep learning applications where human skills and learning capabilities are taken as a benchmark that many try to match and surpass.

Cross-Lingual Transfer Machine Translation +2

CUNI systems for WMT21: Terminology translation Shared Task

no code implementations WMT (EMNLP) 2021 Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, Ondřej Bojar

Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms.

Sentence Translation

Sequence Length is a Domain: Length-based Overfitting in Transformer Models

1 code implementation EMNLP 2021 Dušan Variš, Ondřej Bojar

We demonstrate on a simple string editing task and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data.

L2 Regularization Machine Translation +1

Coarse-To-Fine And Cross-Lingual ASR Transfer

no code implementations2 Sep 2021 Peter Polák, Ondřej Bojar

End-to-end neural automatic speech recognition systems achieved recently state-of-the-art results, but they require large datasets and extensive computing resources.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages

no code implementations ACL 2021 Josef Jon, João Paulo Aires, Dušan Variš, Ondřej Bojar

Lexically constrained machine translation allows the user to manipulate the output sentence by enforcing the presence or absence of certain words and phrases.

Machine Translation Sentence +1

Lost in Interpreting: Speech Translation from Source or Interpreter?

no code implementations17 Jun 2021 Dominik Macháček, Matúš Žilinec, Ondřej Bojar

Interpreters facilitate multi-lingual meetings but the affordable set of languages is often smaller than what is needed.

Machine Translation Translation

Backtranslation Feedback Improves User Confidence in MT, Not Quality

1 code implementation NAACL 2021 Vilém Zouhar, Michal Novák, Matúš Žilinec, Ondřej Bojar, Mateo Obregón, Robin L. Hill, Frédéric Blain, Marina Fomicheva, Lucia Specia, Lisa Yankovskaya

Translating text into a language unknown to the text's author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility.

Machine Translation Translation

How Many Pages? Paper Length Prediction from the Metadata

1 code implementation29 Oct 2020 Erion Çano, Ondřej Bojar

Being able to predict the length of a scientific paper may be helpful in numerous situations.

BIG-bench Machine Learning regression

CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20

no code implementations WMT (EMNLP) 2020 Ivana Kvapilíková, Tom Kocmi, Ondřej Bojar

This paper presents a description of CUNI systems submitted to the WMT20 task on unsupervised and very low-resource supervised machine translation between German and Upper Sorbian.

Machine Translation Transfer Learning +1

Unsupervised Pretraining for Neural Machine Translation Using Elastic Weight Consolidation

no code implementations19 Oct 2020 Dušan Variš, Ondřej Bojar

In our method, we initialize the weights of the encoder and decoder with two language models that are trained with monolingual data and then fine-tune the model on parallel data using Elastic Weight Consolidation (EWC) to avoid forgetting of the original language modeling tasks.

Language Modelling Machine Translation +2

Presenting Simultaneous Translation in Limited Space

no code implementations18 Sep 2020 Dominik Macháček, Ondřej Bojar

Furthermore, we propose a way how to estimate the overall usability of the combination of automatic translation and subtitling by measuring the quality, latency, and stability on a test set, and propose an improved measure for translation latency.

Translation

Automating Text Naturalness Evaluation of NLG Systems

no code implementations23 Jun 2020 Erion Çano, Ondřej Bojar

Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process by using a human likeliness metric we define and a discrimination procedure based on large pretrained language models with their probability distributions.

Text Generation

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

no code implementations5 Jun 2020 Erion Çano, Ondřej Bojar

Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results.

Text Generation

Two Huge Title and Keyword Generation Corpora of Research Articles

no code implementations11 Feb 2020 Erion Çano, Ondřej Bojar

Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora.

Text Summarization Vocal Bursts Valence Prediction

Outbound Translation User Interface Ptakopet: A Pilot Study

1 code implementation25 Nov 2019 Vilém Zouhar, Ondřej Bojar

It is not uncommon for Internet users to have to produce a text in a foreign language they have very little knowledge of and are unable to verify the translation quality.

Translation

Keyphrase Generation: A Multi-Aspect Survey

no code implementations11 Oct 2019 Erion Çano, Ondřej Bojar

In this survey, we examine various aspects of the extractive keyphrase generation methods and focus mostly on the more recent abstractive methods that are based on neural networks.

Keyphrase Generation Text Summarization

In Search for Linear Relations in Sentence Embedding Spaces

no code implementations8 Oct 2019 Petra Barančíková, Ondřej Bojar

We present an introductory investigation into continuous-space vector representations of sentences.

Natural Language Inference Sentence +2

Efficiently Reusing Old Models Across Languages via Transfer Learning

no code implementations EAMT 2020 Tom Kocmi, Ondřej Bojar

To show the applicability of our method, we recycle a Transformer model trained by different researchers and use it to seed models for different language pairs.

Machine Translation NMT +2

Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study

no code implementations14 Sep 2019 Erion Çano, Ondřej Bojar

Using data-driven models for solving text summarization or similar tasks has become very common in the last years.

Text Summarization

SAO WMT19 Test Suite: Machine Translation of Audit Reports

1 code implementation4 Sep 2019 Tereza Vojtěchová, Michal Novák, Miloš Klouček, Ondřej Bojar

This paper describes a machine translation test set of documents from the auditing domain and its use as one of the "test suites" in the WMT19 News Translation Task for translation directions involving Czech, English and German.

Machine Translation Translation

A Test Suite and Manual Evaluation of Document-Level NMT at WMT19

1 code implementation8 Aug 2019 Kateřina Rysová, Magdaléna Rysová, Tomáš Musil, Lucie Poláková, Ondřej Bojar

As the quality of machine translation rises and neural machine translation (NMT) is moving from sentence to document level translations, it is becoming increasingly difficult to evaluate the output of translation systems.

Machine Translation NMT +2

CUNI Systems for the Unsupervised News Translation Task in WMT 2019

no code implementations29 Jul 2019 Ivana Kvapilíková, Dominik Macháček, Ondřej Bojar

In this paper we describe the CUNI translation system used for the unsupervised news shared task of the ACL 2019 Fourth Conference on Machine Translation (WMT19).

Machine Translation Translation

Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation

no code implementations21 Jul 2019 Shantipriya Parida, Ondřej Bojar, Satya Ranjan Dash

We present ``Hindi Visual Genome'', a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research.

Multimodal Machine Translation Translation

Keyphrase Generation: A Text Summarization Struggle

no code implementations29 Mar 2019 Erion Çano, Ondřej Bojar

Most of the proposed supervised and unsupervised methods for keyphrase generation are unable to produce terms that are valuable but do not appear in the text.

Keyphrase Generation Text Summarization

Sentiment Analysis of Czech Texts: An Algorithmic Survey

no code implementations9 Jan 2019 Erion Çano, Ondřej Bojar

In the area of online communication, commerce and transactions, analyzing sentiment polarity of texts written in various natural languages has become crucial.

Sentiment Analysis

Trivial Transfer Learning for Low-Resource Neural Machine Translation

no code implementations WS 2018 Tom Kocmi, Ondřej Bojar

We present a simple transfer learning method, where we first train a "parent" model for a high-resource language pair and then continue the training on a lowresource pair only by replacing the training corpus.

Low-Resource Neural Machine Translation Transfer Learning +1

SubGram: Extending Skip-gram Word Representation with Substrings

1 code implementation18 Jun 2018 Tom Kocmi, Ondřej Bojar

Skip-gram (word2vec) is a recent method for creating vector representations of words ("distributed word representations") using a neural network.

Morphological and Language-Agnostic Word Segmentation for NMT

1 code implementation14 Jun 2018 Dominik Macháček, Jonáš Vidra, Ondřej Bojar

The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity.

Machine Translation NMT +1

Are BLEU and Meaning Representation in Opposition?

no code implementations16 May 2018 Ondřej Cífka, Ondřej Bojar

One of possible ways of obtaining continuous-space sentence representations is by training neural machine translation (NMT) systems.

General Classification Machine Translation +3

Extracting Parallel Paragraphs from Common Crawl

no code implementations27 Apr 2018 Jakub Kúdela, Irena Holubová, Ondřej Bojar

Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages.

Training Tips for the Transformer Model

4 code implementations1 Apr 2018 Martin Popel, Ondřej Bojar

This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017).

Machine Translation Sentence +1

An Exploration of Word Embedding Initialization in Deep-Learning Tasks

no code implementations WS 2017 Tom Kocmi, Ondřej Bojar

We support this hypothesis by observing the performance in learning lexical relations and by the fact that the network can learn to perform reasonably in its task even with fixed random embeddings.

Word Embeddings

LanideNN: Multilingual Language Identification on Character Window

1 code implementation EACL 2017 Tom Kocmi, Ondřej Bojar

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text.

Language Identification

CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks

no code implementations WS 2016 Jindřich Libovický, Jindřich Helcl, Marek Tlustý, Pavel Pecina, Ondřej Bojar

Neural sequence to sequence learning recently became a very promising paradigm in machine translation, achieving competitive results with statistical phrase-based systems.

Automatic Post-Editing Multimodal Machine Translation +1

Cannot find the paper you are looking for? You can Submit a new open access paper.