no code implementations • MTSummit 2021 • Ondřej Bojar, Vojtěch Srdečný, Rishu Kumar, Otakar Smrž, Felix Schneider, Barry Haddow, Phil Williams, Chiara Canton
We describe our experience with providing automatic simultaneous spoken language translation for an event with human interpreters.
no code implementations • IWSLT (EMNLP) 2018 • Tom Kocmi, Dušan Variš, Ondřej Bojar
We present our submission to the IWSLT18 Low Resource task focused on the translation from Basque-to-English.
no code implementations • ICON 2021 • Niyati Bafna, Martin Vastl, Ondřej Bojar
Technical terms may require special handling when the target audience is bilingual, depending on the cultural and educational norms of the society in question.
no code implementations • WMT (EMNLP) 2021 • Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, Ondřej Bojar
Contrary to previous years’ editions, this year we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM).
no code implementations • WMT (EMNLP) 2021 • Michael Hanna, Ondřej Bojar
BERTScore, a recently proposed automatic metric for machine translation quality, uses BERT, a large pre-trained language model to evaluate candidate translations with respect to a gold translation.
no code implementations • WMT (EMNLP) 2021 • Petr Gebauer, Ondřej Bojar, Vojtěch Švandelík, Martin Popel
We use the latter for experiments with various backtranslation techniques.
no code implementations • WMT (EMNLP) 2021 • Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-Jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, Marcos Zampieri
This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021. In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories.
no code implementations • ACL (IWSLT) 2021 • Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremerman, Roldano Cattoni, Maha Elbayad, Marcello Federico, Xutai Ma, Satoshi Nakamura, Matteo Negri, Jan Niehues, Juan Pino, Elizabeth Salesky, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Alexander Waibel, Changhan Wang, Matthew Wiesner
The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2021) featured this year four shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Multilingual speech translation, (iv) Low-resource speech translation.
no code implementations • ACL (WAT) 2021 • Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Sadao Kurohashi
This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021).
no code implementations • WAT 2022 • Toshiaki Nakazawa, Hideya Mino, Isao Goto, Raj Dabre, Shohei Higashiyama, Shantipriya Parida, Anoop Kunchukuttan, Makoto Morishita, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Sadao Kurohashi
This paper presents the results of the shared tasks from the 9th workshop on Asian translation (WAT2022).
no code implementations • LREC 2022 • Anna Nedoluzhko, Muskaan Singh, Marie Hledíková, Tirthankar Ghosal, Ondřej Bojar
Our dataset, AutoMin, consists of 113 (English) and 53 (Czech) meetings, covering more than 160 hours of meeting content.
no code implementations • IWSLT (ACL) 2022 • Antonios Anastasopoulos, Loïc Barrault, Luisa Bentivogli, Marcely Zanon Boito, Ondřej Bojar, Roldano Cattoni, Anna Currey, Georgiana Dinu, Kevin Duh, Maha Elbayad, Clara Emmanuel, Yannick Estève, Marcello Federico, Christian Federmann, Souhir Gahbiche, Hongyu Gong, Roman Grundkiewicz, Barry Haddow, Benjamin Hsu, Dávid Javorský, Vĕra Kloudová, Surafel Lakew, Xutai Ma, Prashant Mathur, Paul McNamee, Kenton Murray, Maria Nǎdejde, Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Niu, John Ortega, Juan Pino, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Yogesh Virkar, Alexander Waibel, Changhan Wang, Shinji Watanabe
The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation.
no code implementations • EACL (HumEval) 2021 • Věra Kloudová, Ondřej Bojar, Martin Popel
This paper provides a quick overview of possible methods how to detect that reference translations were actually created by post-editing an MT system.
no code implementations • ACL (WAT) 2021 • Shantipriya Parida, Subhadarshi Panda, Ketan Kotwal, Amulya Ratna Dash, Satya Ranjan Dash, Yashvardhan Sharma, Petr Motlicek, Ondřej Bojar
Our submission tops in English→Malayalam Multimodal translation task (text-only translation, and Malayalam caption), and ranks second-best in English→Hindi Multimodal translation task (text-only translation, and Hindi caption).
no code implementations • EAMT 2020 • Ondřej Bojar, Dominik Macháček, Sangeet Sagar, Otakar Smrž, Jonáš Kratochvíl, Ebrahim Ansari, Dario Franceschini, Chiara Canton, Ivan Simonini, Thai-Son Nguyen, Felix Schneider, Sebastian Stücker, Alex Waibel, Barry Haddow, Rico Sennrich, Philip Williams
ELITR (European Live Translator) project aims to create a speech translation system for simultaneous subtitling of conferences and online meetings targetting up to 43 languages.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • LREC (BUCC) 2022 • Borek Požár, Klára Tauchmanová, Kristýna Neumannová, Ivana Kvapilíková, Ondřej Bojar
We present our submission to the BUCC Shared Task on bilingual term alignment in comparable specialized corpora.
1 code implementation • LREC 2022 • Muskan Garg, Seema Wazarkar, Muskaan Singh, Ondřej Bojar
With the development of multimodal systems and natural language generation techniques, the resurgence of multimodal datasets has attracted significant research interests, which aims to provide new information to enrich the representation of textual data.
no code implementations • IWSLT 2016 • Ondřej Bojar, Ondřej Cífka, Jindřich Helcl, Tom Kocmi, Roman Sudarikov
We present our submissions to the IWSLT 2016 machine translation task, as our first attempt to translate subtitles and one of our early experiments with neural machine translation (NMT).
no code implementations • WMT (EMNLP) 2020 • Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong Ma, Ondřej Bojar
Participants were asked to score the outputs of the translation systems competing in the WMT20 News Translation Task with automatic metrics.
1 code implementation • WMT (EMNLP) 2020 • Vilém Zouhar, Tereza Vojtěchová, Ondřej Bojar
For an annotation experiment of two phases, we chose Czech and English documents translated by systems submitted to WMT20 News Translation Task.
no code implementations • AACL (WAT) 2020 • Shantipriya Parida, Petr Motlicek, Amulya Ratna Dash, Satya Ranjan Dash, Debasish Kumar Mallick, Satya Prakash Biswal, Priyanka Pattnaik, Biranchi Narayan Nayak, Ondřej Bojar
We have participated in the English-Hindi Multimodal task and Indic task.
no code implementations • AACL (WAT) 2020 • Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Sadao Kurohashi
This paper presents the results of the shared tasks from the 7th workshop on Asian translation (WAT2020).
no code implementations • 24 Dec 2024 • Sara Papi, Peter Polak, Ondřej Bojar, Dominik Macháček
Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker's speech, ensuring low latency for better user comprehension.
no code implementations • 7 Nov 2024 • Ibrahim Said Ahmad, Antonios Anastasopoulos, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, William Chen, Qianqian Dong, Marcello Federico, Barry Haddow, Dávid Javorský, Mateusz Krubiński, Tsz Kin Lam, Xutai Ma, Prashant Mathur, Evgeny Matusov, Chandresh Maurya, John McCrae, Kenton Murray, Satoshi Nakamura, Matteo Negri, Jan Niehues, Xing Niu, Atul Kr. Ojha, John Ortega, Sara Papi, Peter Polák, Adam Pospíšil, Pavel Pecina, Elizabeth Salesky, Nivedita Sethiya, Balaram Sarkar, Jiatong Shi, Claytone Sikasote, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Brian Thompson, Marco Turchi, Alex Waibel, Shinji Watanabe, Patrick Wilken, Petr Zemánek, Rodolfo Zevallos
This paper reports on the shared tasks organized by the 21st IWSLT Conference.
no code implementations • 17 Oct 2024 • Patrik Zavoral, Dušan Variš, Ondřej Bojar
The Transformer model has a tendency to overfit various aspects of the training data, such as the overall sequence length.
no code implementations • 6 Jun 2024 • Matthias Sperber, Ondřej Bojar, Barry Haddow, Dávid Javorský, Xutai Ma, Matteo Negri, Jan Niehues, Peter Polák, Elizabeth Salesky, Katsuhito Sudoh, Marco Turchi
Human evaluation is a critical component in machine translation system development and has received much attention in text translation research.
no code implementations • 22 Apr 2024 • Sunit Bhattacharya, Ondřej Bojar
In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of a Large Language Model, examining its architecture, activation patterns, and processing mechanisms across languages.
1 code implementation • 31 Mar 2024 • Uladzislau Yorsh, Martin Holeňa, Ondřej Bojar, David Herel
Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing.
1 code implementation • 2 Jan 2024 • Vilém Zouhar, Ondřej Bojar
Automatic machine translation metrics typically rely on human translations to determine the quality of system translations.
1 code implementation • 28 Nov 2023 • Vilém Zouhar, Věra Kloudová, Martin Popel, Ondřej Bojar
The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good.
no code implementations • 22 Oct 2023 • Ivana Kvapilíková, Ondřej Bojar
Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge.
no code implementations • 20 Sep 2023 • Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar
Further, this method lacks mechanisms for \textit{controlling} the quality vs. latency tradeoff.
no code implementations • 20 Sep 2023 • Peter Polák, Ondřej Bojar
On a diverse set of language pairs and in- and out-of-domain data, we show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
no code implementations • 11 Sep 2023 • František Kmječ, Ondřej Bojar
The tool provides a live transcript and a live meeting summary to the users, who can edit them in a collaborative manner, enabling correction of ASR errors and imperfect summary points in real time.
no code implementations • 8 Aug 2023 • Josef Jon, Ondřej Bojar
We explore the effectiveness of character-level neural machine translation using Transformer architecture for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish.
no code implementations • 7 Aug 2023 • Josef Jon, Dušan Variš, Michal Novák, João Paulo Aires, Ondřej Bojar
This paper explores negative lexical constraining in English to Czech neural machine translation.
1 code implementation • 27 Jul 2023 • Dominik Macháček, Raj Dabre, Ondřej Bojar
Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription.
1 code implementation • 31 May 2023 • Dávid Javorský, Ondřej Bojar, François Yvon
Many NLP tasks require to automatically identify the most significant words in a text.
no code implementations • 30 May 2023 • Josef Jon, Ondřej Bojar
With a combination of multiple MT metrics as the fitness function, the proposed method leads to an increase in translation quality as measured by other held-out automatic metrics.
1 code implementation • 28 May 2023 • Shantipriya Parida, Idris Abdulmumin, Shamsuddeen Hassan Muhammad, Aneesh Bose, Guneet Singh Kohli, Ibrahim Said Ahmad, Ketan Kotwal, Sayan Deb Sarkar, Ondřej Bojar, Habeebah Adamu Kakudi
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language.
no code implementations • 26 May 2023 • Dominik Macháček, Peter Polák, Ondřej Bojar, Raj Dabre
Automatic speech translation is sensitive to speech recognition errors, but in a multilingual scenario, the same content may be available in various languages via simultaneous interpreting, dubbing or subtitling.
no code implementations • 20 Mar 2023 • Vilém Zouhar, Sunit Bhattacharya, Ondřej Bojar
To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2).
no code implementations • 29 Nov 2022 • Josef Jon, Martin Popel, Ondřej Bojar
We evaluate performance of MBR decoding compared to traditional mixed backtranslation training and we show a possible synergy when using both of the techniques simultaneously.
1 code implementation • 16 Nov 2022 • Dominik Macháček, Ondřej Bojar, Raj Dabre
There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET.
no code implementations • 18 Oct 2022 • Sukanta Sen, Ondřej Bojar, Barry Haddow
In the cascaded approach to spoken language translation (SLT), the ASR output is typically punctuated and segmented into sentences before being passed to MT, since the latter is typically trained on written text.
1 code implementation • 13 Oct 2022 • Sunit Bhattacharya, Vilém Zouhar, Ondřej Bojar
It is unclear whether, how and where large pre-trained language models capture subtle linguistic traits like ambiguity, grammaticality and sentence complexity.
no code implementations • LREC 2022 • Peter Polák, Muskaan Singh, Anna Nedoluzhko, Ondřej Bojar
To facilitate the research in this area, we present ALIGNMEET, a comprehensive tool for meeting annotation, alignment, and evaluation.
no code implementations • LREC 2022 • Idris Abdulmumin, Satya Ranjan Dash, Musa Abdullahi Dawud, Shantipriya Parida, Shamsuddeen Hassan Muhammad, Ibrahim Sa'id Ahmad, Subhadarshi Panda, Ondřej Bojar, Bashir Shehu Galadanci, Bello Shehu Bello
The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.
no code implementations • IWSLT (ACL) 2022 • Peter Polák, Ngoc-Quan Ngoc, Tuan-Nam Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bojar, Alexander Waibel
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022.
1 code implementation • 6 Apr 2022 • Sunit Bhattacharya, Věra Kloudová, Vilém Zouhar, Ondřej Bojar
We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants.
no code implementations • 29 Mar 2022 • Christian Huber, Rishu Kumar, Ondřej Bojar, Alexander Waibel
In this paper we study, a) methods to acquire important words for this memory dynamically and, b) the trade-off between improvement in recognition accuracy of new words and the potential danger of false alarms for those added words.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 4 Mar 2022 • Dávid Javorský, Dominik Macháček, Ondřej Bojar
Simultaneous speech translation (SST) can be evaluated on simulated online events where human evaluators watch subtitled videos and continuously express their satisfaction by pressing buttons (so called Continuous Rating).
no code implementations • 25 Feb 2022 • Tom Kocmi, Dominik Macháček, Ondřej Bojar
Machine translation is for us a prime example of deep learning applications where human skills and learning capabilities are taken as a benchmark that many try to match and surpass.
no code implementations • WMT (EMNLP) 2021 • Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, Ondřej Bojar
Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms.
no code implementations • WMT (EMNLP) 2021 • Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, Ondřej Bojar
This paper describes Charles University submission for Multilingual Low-Resource Translation for Indo-European Languages shared task at WMT21.
1 code implementation • EMNLP 2021 • Dušan Variš, Ondřej Bojar
We demonstrate on a simple string editing task and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data.
1 code implementation • EMNLP 2021 • Vilém Zouhar, Aleš Tamchyna, Martin Popel, Ondřej Bojar
We test the natural expectation that using MT in professional translation saves human processing time.
no code implementations • 2 Sep 2021 • Peter Polák, Ondřej Bojar
End-to-end neural automatic speech recognition systems achieved recently state-of-the-art results, but they require large datasets and extensive computing resources.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • ACL 2021 • Josef Jon, João Paulo Aires, Dušan Variš, Ondřej Bojar
Lexically constrained machine translation allows the user to manipulate the output sentence by enforcing the presence or absence of certain words and phrases.
no code implementations • 17 Jun 2021 • Dominik Macháček, Matúš Žilinec, Ondřej Bojar
Interpreters facilitate multi-lingual meetings but the affordable set of languages is often smaller than what is needed.
no code implementations • ACL 2020 • Ivana Kvapilikova, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar
Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages.
1 code implementation • NAACL 2021 • Vilém Zouhar, Michal Novák, Matúš Žilinec, Ondřej Bojar, Mateo Obregón, Robin L. Hill, Frédéric Blain, Marina Fomicheva, Lucia Specia, Lisa Yankovskaya
Translating text into a language unknown to the text's author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility.
no code implementations • 17 Feb 2021 • Rudolf Rosa, Tomáš Musil, Ondřej Dušek, Dominik Jurko, Patrícia Schmidtová, David Mareček, Ondřej Bojar, Tom Kocmi, Daniel Hrbek, David Košťák, Martina Kinská, Marie Nováková, Josef Doležal, Klára Vosecká, Tomáš Studeník, Petr Žabka
We present the first version of a system for interactive generation of theatre play scripts.
no code implementations • EMNLP 2020 • Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, Marcos Zampieri
In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories.
1 code implementation • 29 Oct 2020 • Erion Çano, Ondřej Bojar
Being able to predict the length of a scientific paper may be helpful in numerous situations.
no code implementations • WMT (EMNLP) 2020 • Ivana Kvapilíková, Tom Kocmi, Ondřej Bojar
This paper presents a description of CUNI systems submitted to the WMT20 task on unsupervised and very low-resource supervised machine translation between German and Upper Sorbian.
no code implementations • 19 Oct 2020 • Dušan Variš, Ondřej Bojar
In our method, we initialize the weights of the encoder and decoder with two language models that are trained with monolingual data and then fine-tune the model on parallel data using Elastic Weight Consolidation (EWC) to avoid forgetting of the original language modeling tasks.
no code implementations • 18 Sep 2020 • Dominik Macháček, Ondřej Bojar
Furthermore, we propose a way how to estimate the overall usability of the combination of automatic translation and subtitling by measuring the quality, latency, and stability on a test set, and propose an improved measure for translation latency.
no code implementations • 25 Jun 2020 • Rudolf Rosa, Ondřej Dušek, Tom Kocmi, David Mareček, Tomáš Musil, Patrícia Schmidtová, Dominik Jurko, Ondřej Bojar, Daniel Hrbek, David Košťák, Martina Kinská, Josef Doležal, Klára Vosecká
We present THEaiTRE, a starting project aimed at automatic generation of theatre play scripts.
no code implementations • 23 Jun 2020 • Erion Çano, Ondřej Bojar
Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process by using a human likeliness metric we define and a discrimination procedure based on large pretrained language models with their probability distributions.
no code implementations • WS 2020 • Dominik Macháček, Jonáš Kratochvíl, Sangeet Sagar, Matúš Žilinec, Ondřej Bojar, Thai-Son Nguyen, Felix Schneider, Philip Williams, Yuekun Yao
This paper is an ELITR system submission for the non-native speech translation task at IWSLT 2020.
no code implementations • 5 Jun 2020 • Erion Çano, Ondřej Bojar
Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results.
no code implementations • 11 Feb 2020 • Erion Çano, Ondřej Bojar
Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora.
1 code implementation • 25 Nov 2019 • Vilém Zouhar, Ondřej Bojar
It is not uncommon for Internet users to have to produce a text in a foreign language they have very little knowledge of and are unable to verify the translation quality.
no code implementations • 24 Oct 2019 • Thuong-Hai Pham, Dominik Macháček, Ondřej Bojar
The data manipulation techniques, recommended in previous works, prove ineffective in large data settings.
no code implementations • 11 Oct 2019 • Erion Çano, Ondřej Bojar
In this survey, we examine various aspects of the extractive keyphrase generation methods and focus mostly on the more recent abstractive methods that are based on neural networks.
no code implementations • 8 Oct 2019 • Petra Barančíková, Ondřej Bojar
We present an introductory investigation into continuous-space vector representations of sentences.
no code implementations • EAMT 2020 • Tom Kocmi, Ondřej Bojar
To show the applicability of our method, we recycle a Transformer model trained by different researchers and use it to seed models for different language pairs.
no code implementations • 14 Sep 2019 • Erion Çano, Ondřej Bojar
Using data-driven models for solving text summarization or similar tasks has become very common in the last years.
1 code implementation • 4 Sep 2019 • Tereza Vojtěchová, Michal Novák, Miloš Klouček, Ondřej Bojar
This paper describes a machine translation test set of documents from the auditing domain and its use as one of the "test suites" in the WMT19 News Translation Task for translation directions involving Czech, English and German.
1 code implementation • 8 Aug 2019 • Kateřina Rysová, Magdaléna Rysová, Tomáš Musil, Lucie Poláková, Ondřej Bojar
As the quality of machine translation rises and neural machine translation (NMT) is moving from sentence to document level translations, it is becoming increasingly difficult to evaluate the output of translation systems.
no code implementations • 2 Aug 2019 • Dominik Macháček, Jonáš Kratochvíl, Tereza Vojtěchová, Ondřej Bojar
We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • WS 2019 • Martin Popel, Dominik Macháček, Michal Auersperger, Ondřej Bojar, Pavel Pecina
We describe our NMT systems submitted to the WMT19 shared task in English-Czech news translation.
no code implementations • 29 Jul 2019 • Ivana Kvapilíková, Dominik Macháček, Ondřej Bojar
In this paper we describe the CUNI translation system used for the unsupervised news shared task of the ACL 2019 Fourth Conference on Machine Translation (WMT19).
no code implementations • 21 Jul 2019 • Shantipriya Parida, Ondřej Bojar, Satya Ranjan Dash
We present ``Hindi Visual Genome'', a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research.
no code implementations • 29 Mar 2019 • Erion Çano, Ondřej Bojar
Most of the proposed supervised and unsupervised methods for keyphrase generation are unable to produce terms that are valuable but do not appear in the text.
no code implementations • 9 Jan 2019 • Erion Çano, Ondřej Bojar
In the area of online communication, commerce and transactions, analyzing sentiment polarity of texts written in various natural languages has become crucial.
no code implementations • WS 2018 • Tom Kocmi, Ondřej Bojar
We present a simple transfer learning method, where we first train a "parent" model for a high-resource language pair and then continue the training on a lowresource pair only by replacing the training corpus.
Low Resource Neural Machine Translation
Low-Resource Neural Machine Translation
+2
1 code implementation • 18 Jun 2018 • Tom Kocmi, Ondřej Bojar
Skip-gram (word2vec) is a recent method for creating vector representations of words ("distributed word representations") using a neural network.
1 code implementation • 14 Jun 2018 • Dominik Macháček, Jonáš Vidra, Ondřej Bojar
The state of the art of handling rich morphology in neural machine translation (NMT) is to break word forms into subword units, so that the overall vocabulary size of these units fits the practical limits given by the NMT model and GPU memory capacity.
no code implementations • 16 May 2018 • Ondřej Cífka, Ondřej Bojar
One of possible ways of obtaining continuous-space sentence representations is by training neural machine translation (NMT) systems.
no code implementations • 27 Apr 2018 • Jakub Kúdela, Irena Holubová, Ondřej Bojar
Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages.
4 code implementations • 1 Apr 2018 • Martin Popel, Ondřej Bojar
This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017).
no code implementations • WS 2017 • Tom Kocmi, Ondřej Bojar
We support this hypothesis by observing the performance in learning lexical relations and by the fact that the network can learn to perform reasonably in its task even with fixed random embeddings.
1 code implementation • MTSummit 2017 • Matīss Rikters, Ondřej Bojar
Processing of multi-word expressions (MWEs) is a known problem for any natural language processing task.
1 code implementation • EACL 2017 • Tom Kocmi, Ondřej Bojar
In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text.
no code implementations • ACL 2016 • Aleš Tamchyna, Alexander Fraser, Ondřej Bojar, Marcin Junczys-Dowmunt
Discriminative translation models utilizing source context have been shown to help statistical machine translation performance.
no code implementations • WS 2016 • Jindřich Libovický, Jindřich Helcl, Marek Tlustý, Pavel Pecina, Ondřej Bojar
Neural sequence to sequence learning recently became a very promising paradigm in machine translation, achieving competitive results with statistical phrase-based systems.