1 code implementation • 21 Oct 2024 • Michal Novák, Barbora Dohnalová, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman
The paper presents an overview of the third edition of the shared task on multilingual coreference resolution, held as part of the CRAC 2024 workshop.
1 code implementation • 3 Oct 2024 • Milan Straka
In this third iteration of the shared task, a novel objective is to also predict empty nodes needed for zero coreference mentions (while the empty nodes were given on input in previous years).
1 code implementation • 16 Sep 2024 • Vojtěch Vančura, Pavel Kordík, Milan Straka
In this paper, we propose beeFormer, a framework for training sentence Transformer models with interaction data.
1 code implementation • 18 Jun 2024 • Milan Straka, Jana Straková
We present an open-source web service for Czech morphosyntactic analysis.
1 code implementation • 31 May 2024 • Josef Vonášek, Milan Straka, Rostislav Krč, Lenka Lasoňová, Ekaterina Egorova, Jana Straková, Jakub Náplava
We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click dataset for relevance ranking with user behavior data collected from search engine logs of Seznam$.$cz.
1 code implementation • 8 Apr 2024 • Milan Straka, Jana Straková, Federica Gamba
Our system consists of a fine-tuned concatenation of base and large pre-trained LMs, with a dot-product attention head for parsing and softmax classification heads for morphology to jointly learn both dependency parsing and morphological analysis.
1 code implementation • 20 Mar 2024 • Jiří Mayer, Milan Straka, Jan Hajič jr., Pavel Pecina
(c) We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model.
1 code implementation • 24 Nov 2023 • Milan Straka
We present CorPipe, the winning entry to the CRAC 2023 Shared Task on Multilingual Coreference Resolution.
no code implementations • 15 Jun 2023 • David Kubeša, Milan Straka
The dataset contains 27. 9M named entities in the knowledge base and 12. 3G tokens from Wikipedia texts.
no code implementations • LREC 2022 • Marie Mikulová, Milan Straka, Jan Štěpánek, Barbora Štěpánková, Jan Hajič
This paper presents an analysis of annotation using an automatic pre-annotation for a mid-level annotation complexity task -- dependency syntax annotation.
1 code implementation • CRAC (ACL) 2022 • Milan Straka, Jana Straková
We describe the winning submission to the CRAC 2022 Shared Task on Multilingual Coreference Resolution.
no code implementations • 14 Jan 2022 • Jakub Náplava, Milan Straka, Jana Straková, Alexandr Rosen
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English.
1 code implementation • WNUT (ACL) 2021 • Milan Straka, Jakub Náplava, Jana Straková
We propose a character-based nonautoregressive GEC approach, with automatically generated character transformations.
1 code implementation • WNUT (ACL) 2021 • David Samuel, Milan Straka
We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages.
1 code implementation • WNUT (ACL) 2021 • Jakub Náplava, Martin Popel, Milan Straka, Jana Straková
We also compare two approaches to address the performance drop: a) training the NLP models with noised data generated by our framework; and b) reducing the input noise with external system for natural language correction.
no code implementations • 24 May 2021 • Milan Straka, Jakub Náplava, Jana Straková, David Samuel
We present RobeCzech, a monolingual RoBERTa language representation model trained on Czech data.
Ranked #1 on Semantic Parsing on PTG (czech, MRP 2020)
1 code implementation • 24 May 2021 • Jakub Náplava, Milan Straka, Jana Straková
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics.
no code implementations • 18 Feb 2021 • Milan Straka, Lucia Piatriková, Peter van Bokhoven, Ľuboš Buzna
Based on the electric vehicle (EV) arrival times and the duration of EV connection to the charging station, we identify charging patterns and derive groups of charging stations with similar charging patterns applying two approaches.
2 code implementations • 2 Nov 2020 • David Samuel, Milan Straka
PERIN was one of the winners of the shared task.
Ranked #1 on Semantic Parsing on DRG (english, MRP 2020)
1 code implementation • CONLL 2020 • David Samuel, Milan Straka
PERIN was one of the winners of the shared task.
no code implementations • 3 Jul 2020 • Kateřina Macková, Milan Straka
We report that a XLM-RoBERTa model trained on English data and evaluated on Czech achieves very competitive performance, only approximately 2 percent points worse than a~model trained on the translated Czech data.
no code implementations • LREC 2020 • Milan Straka, Jana Straková
We present our contribution to the EvaLatin shared task, which is the first evaluation campaign devoted to the evaluation of NLP tools for Latin.
no code implementations • 5 Jun 2020 • Jan Hajič, Eduard Bejček, Jaroslava Hlaváčová, Marie Mikulová, Milan Straka, Jan Štěpánek, Barbora Štěpánková
We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1. 0 (PDT-C 1. 0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research.
no code implementations • 2 Jun 2020 • Milan Straka, Rui Carvalho, Gijs van der Poel, Ľuboš Buzna
We identified the most influential features correlated with energy consumption, indicating that the spatial context of the charging infrastructure affects its utilization pattern.
no code implementations • LREC 2020 • Jan Haji{\v{c}}, Eduard Bej{\v{c}}ek, Jaroslava Hlavacova, Marie Mikulov{\'a}, Milan Straka, Jan {\v{S}}t{\v{e}}p{\'a}nek, Barbora {\v{S}}t{\v{e}}p{\'a}nkov{\'a}
We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1. 0 (PDT-C 1. 0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research.
1 code implementation • CONLL 2019 • Milan Straka, Jana Strakov{\'a}
We present a system description of our contribution to the CoNLL 2019 shared task, CrossFramework Meaning Representation Parsing (MRP 2019).
no code implementations • CONLL 2019 • Stephan Oepen, Omri Abend, Jan Hajic, Daniel Hershcovich, Marco Kuhlmann, Tim O{'}Gorman, Nianwen Xue, Jayeol Chun, Milan Straka, Zdenka Uresova
The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks.
1 code implementation • 24 Oct 2019 • Milan Straka, Jana Straková
We present a system description of our contribution to the CoNLL 2019 shared task, Cross-Framework Meaning Representation Parsing (MRP 2019).
no code implementations • 6 Oct 2019 • Milan Straka, Pasquale De Falco, Gabriella Ferruzzi, Daniela Proto, Gijs van der Poel, Shahab Khormali, Ľuboš Buzna
The availability of charging infrastructure is essential for large-scale adoption of electric vehicles (EV).
1 code implementation • WS 2019 • Jakub Náplava, Milan Straka
Grammatical error correction in English is a long studied problem with many existing systems and datasets.
Ranked #4 on Grammatical Error Correction on Falko-MERLIN (using extra training data)
no code implementations • WS 2019 • Jakub Náplava, Milan Straka
In this paper, we describe our systems submitted to the Building Educational Applications (BEA) 2019 Shared Task (Bryant et al., 2019).
no code implementations • 8 Sep 2019 • Milan Straka, Jana Straková, Jan Hajič
We evaluate two meth ods for precomputing such embeddings, BERT and Flair, on four Czech text processing tasks: part-of-speech (POS) tagging, lemmatization, dependency pars ing and named entity recognition (NER).
no code implementations • 20 Aug 2019 • Milan Straka, Jana Straková, Jan Hajič
We present an extensive evaluation of three recently proposed methods for contextualized embeddings on 89 corpora in 54 languages of the Universal Dependencies 2. 3 in three tasks: POS tagging, lemmatization, and dependency parsing.
Ranked #1 on Dependency Parsing on Universal Dependencies
no code implementations • WS 2019 • Milan Straka, Jana Straková, Jan Hajič
In the morphological analysis, our system placed tightly second: our morphological analysis accuracy was 93. 19, the winning system's 93. 23.
1 code implementation • ACL 2019 • Jana Straková, Milan Straka, Jan Hajič
We propose two neural network architectures for nested named entity recognition (NER), a setting in which named entities may overlap and also be labeled with more than one label.
Ranked #3 on Nested Mention Recognition on ACE 2005
3 code implementations • IJCNLP 2019 • Dan Kondratyuk, Milan Straka
We present UDify, a multilingual multi-task model capable of accurately predicting universal part-of-speech, morphological features, lemmas, and dependency trees simultaneously for all 124 Universal Dependencies treebanks across 75 languages.
Ranked #2 on Dependency Parsing on French GSD
1 code implementation • EMNLP 2018 • Daniel Kondratyuk, Tom{\'a}{\v{s}} Gaven{\v{c}}iak, Milan Straka, Jan Haji{\v{c}}
We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings.
no code implementations • CONLL 2018 • Daniel Zeman, Jan Haji{\v{c}}, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, Slav Petrov
Every year, the Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets.
no code implementations • CONLL 2018 • Milan Straka
UDPipe is a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing.
Ranked #6 on Dependency Parsing on Universal Dependencies
2 code implementations • 10 Aug 2018 • Daniel Kondratyuk, Tomáš Gavenčiak, Milan Straka, Jan Hajič
We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings.
no code implementations • CONLL 2017 • Milan Straka, Jana Strakov{\'a}
A multilingual pipeline performing these steps can be trained using the Universal Dependencies project, which contains annotations of the described tasks for 50 languages in the latest release UD 2. 0.
no code implementations • CONLL 2017 • Daniel Zeman, Martin Popel, Milan Straka, Jan Haji{\v{c}}, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkov{\'a}, Jan Haji{\v{c}} jr., Jaroslava Hlav{\'a}{\v{c}}ov{\'a}, V{\'a}clava Kettnerov{\'a}, Zde{\v{n}}ka Ure{\v{s}}ov{\'a}, Jenna Kanerva, Stina Ojala, Anna Missil{\"a}, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria de Paiva, Kira Droganova, H{\'e}ctor Mart{\'\i}nez Alonso, {\c{C}}a{\u{g}}r{\i} {\c{C}}{\"o}ltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, M, Michael l, Jesse Kirchner, Hector Fern Alcalde, ez, Jana Strnadov{\'a}, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendon{\c{c}}a, L, Tatiana o, Rattima Nitisaroj, Josie Li
The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets.
no code implementations • WS 2017 • Natalia Klyueva, Antoine Doucet, Milan Straka
In this paper we describe the MUMULS system that participated to the 2017 shared task on automatic identification of verbal multiword expressions (VMWEs).
no code implementations • LREC 2016 • Milan Straka, Jan Haji{\v{c}}, Jana Strakov{\'a}
Automatic natural language processing of large texts often presents recurring challenges in multiple languages: even for most advanced tasks, the texts are first processed by basic processing steps {--} from tokenization to parsing.
no code implementations • LREC 2016 • Zden{\v{e}}k {\v{Z}}abokrtsk{\'y}, Magda {\v{S}}ev{\v{c}}{\'\i}kov{\'a}, Milan Straka, Jon{\'a}{\v{s}} Vidra, Ad{\'e}la Limbursk{\'a}
The paper deals with merging two complementary resources of morphological data previously existing for Czech, namely the inflectional dictionary MorfFlex CZ and the recently developed lexical network DeriNet.