Search Results for author: Marcos Zampieri

Found 90 papers, 13 papers with code

Findings of the 2021 Conference on Machine Translation (WMT21)

no code implementations WMT (EMNLP) 2021 Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-Jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, Marcos Zampieri

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021. In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories.

Machine Translation Translation

A Report on the VarDial Evaluation Campaign 2020

no code implementations VarDial (COLING) 2020 Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

FBERT: A Neural Transformer for Identifying Offensive Content

no code implementations Findings (EMNLP) 2021 Diptanu Sarkar, Marcos Zampieri, Tharindu Ranasinghe, Alexander Ororbia

Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media.

Language Identification

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

1 code implementation1 Sep 2021 Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, Emily Hill

We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names.

Part-Of-Speech Tagging

Handling Extreme Class Imbalance in Technical Logbook Datasets

no code implementations ACL 2021 Farhad Akhbardeh, Cecilia Ovesdotter Alm, Marcos Zampieri, Travis Desell

In this paper we focus on the problem of technical issue classification by considering logbook datasets from the automotive, aviation, and facilities maintenance domains.

Multilingual Offensive Language Identification for Low-resource Languages

no code implementations12 May 2021 Tharindu Ranasinghe, Marcos Zampieri

We report results of 0. 8415 F1 macro for Bengali in TRAC-2 shared task, 0. 8532 F1 macro for Danish and 0. 8701 F1 macro for Greek in OffensEval 2020, 0. 8568 F1 macro for Hindi in HASOC 2019 shared task and 0. 7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) showing that our approach compares favourably to the best systems submitted to recent shared tasks on these three languages.

Language Identification Transfer Learning +1

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

1 code implementation SEMEVAL 2021 Tharindu Ranasinghe, Diptanu Sarkar, Marcos Zampieri, Alexander Ororbia

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms.

Toxic Spans Detection

Domain-specific MT for Low-resource Languages: The case of Bambara-French

no code implementations31 Mar 2021 Allahsera Auguste Tapo, Michael Leventhal, Sarah Luger, Christopher M. Homan, Marcos Zampieri

Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of parallel data.

Machine Translation Translation

Comparing Approaches to Dravidian Language Identification

no code implementations EACL (VarDial) 2021 Tommi Jauhiainen, Tharindu Ranasinghe, Marcos Zampieri

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop.

Dialect Identification Text Classification

MUDES: Multilingual Detection of Offensive Spans

1 code implementation NAACL 2021 Tharindu Ranasinghe, Marcos Zampieri

The interest in offensive content identification in social media has grown substantially in recent years.

Predicting Lexical Complexity in English Texts

no code implementations17 Feb 2021 Matthew Shardlow, Richard Evans, Marcos Zampieri

The first step in most text simplification is to predict which words are considered complex for a given target population before carrying out lexical substitution.

Complex Word Identification Text Simplification

WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

no code implementations1 Nov 2020 Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu

This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020.

Language Identification Transfer Learning +1

Multilingual Offensive Language Identification with Cross-lingual Embeddings

1 code implementation EMNLP 2020 Tharindu Ranasinghe, Marcos Zampieri

In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources.

Language Identification Transfer Learning +1

MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources

no code implementations COLING 2020 Farhad Akhbardeh, Travis Desell, Marcos Zampieri

Furthermore, it provides a way to encourage discussion on and sharing of new datasets and tools for logbook data analysis.

CompLex --- A New Corpus for Lexical Complexity Prediction from Likert Scale Data

no code implementations LREC 2020 Matthew Shardlow, Michael Cooper, Marcos Zampieri

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such astext simplification.

Complex Word Identification Lexical Complexity Prediction

Evaluating Aggression Identification in Social Media

no code implementations LREC 2020 Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Marcos Zampieri

The task consisted of two sub-tasks - aggression identification (sub-task A) and gendered identification (sub-task B) - in three languages - Bangla, Hindi and English.

Aggression Identification

SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

no code implementations Findings (ACL) 2021 Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri, Preslav Nakov

The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression.

Language Identification

Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study

no code implementations31 Mar 2020 Michael Leventhal, Allahsera Tapo, Sarah Luger, Marcos Zampieri, Christopher M. Homan

We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages.

Machine Translation Translation

CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data

1 code implementation16 Mar 2020 Matthew Shardlow, Michael Cooper, Marcos Zampieri

With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text.

Complex Word Identification Lexical Complexity Prediction +1

Offensive Language Identification in Greek

1 code implementation LREC 2020 Zeses Pitenis, Marcos Zampieri, Tharindu Ranasinghe

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc.

Language Identification

Improving CAT Tools in the Translation Workflow: New Approaches and Evaluation

no code implementations WS 2019 Mihaela Vela, Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, Josef van Genabith

User feedback revealed that the users preferred using CATaLog Online over existing CAT tools in some respects, especially by selecting the output of the MT system and taking advantage of the color scheme for TM suggestions.

Automatic Post-Editing Translation

UDS--DFKI Submission to the WMT2019 Similar Language Translation Shared Task

no code implementations16 Aug 2019 Santanu Pal, Marcos Zampieri, Josef van Genabith

The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish.

Translation

UDS--DFKI Submission to the WMT2019 Czech--Polish Similar Language Translation Shared Task

no code implementations WS 2019 Santanu Pal, Marcos Zampieri, Josef van Genabith

The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish.

Translation

A Report on the Third VarDial Evaluation Campaign

no code implementations WS 2019 Marcos Zampieri, Shervin Malmasi, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, Tommi Jauhiainen

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019.

Dialect Identification Morphological Analysis

Experiments in Cuneiform Language Identification

no code implementations WS 2019 Gustavo Henrique Paetzold, Marcos Zampieri

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world.

Language Identification

UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks

no code implementations SEMEVAL 2019 Gustavo Henrique Paetzold, Shervin Malmasi, Marcos Zampieri

We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset.

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

1 code implementation SEMEVAL 2019 Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval).

Language Identification

Classifying Patent Applications with Ensemble Methods

no code implementations ALTA 2018 Fernando Benites, Shervin Malmasi, Marcos Zampieri

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications.

Classification General Classification

Classifier Ensembles for Dialect and Language Variety Identification

no code implementations14 Aug 2018 Liviu P. Dinu, Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi

In this paper we present ensemble-based systems for dialect and language variety identification using the datasets made available by the organizers of the VarDial Evaluation Campaign 2018.

Dialect Identification

Benchmarking Aggression Identification in Social Media

no code implementations COLING 2018 Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Marcos Zampieri

For this task, the participants were provided with a dataset of 15, 000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation.

Aggression Identification

Discriminating between Indo-Aryan Languages Using SVM Ensembles

no code implementations COLING 2018 Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Santanu Pal, Liviu P. Dinu

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi.

Language Identification

A Neural Approach to Language Variety Translation

no code implementations COLING 2018 Marta R. Costa-jussà, Marcos Zampieri, Santanu Pal

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language.

Machine Translation Translation

A Portuguese Native Language Identification Dataset

no code implementations WS 2018 Iria del Río, Marcos Zampieri, Shervin Malmasi

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing.

Language Acquisition Native Language Identification +1

Automatic Language Identification in Texts: A Survey

1 code implementation22 Apr 2018 Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in.

Language Identification

Challenges in Discriminating Profanity from Hate Speech

no code implementations14 Mar 2018 Shervin Malmasi, Marcos Zampieri

In this study we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered.

General Classification

RDF2PT: Generating Brazilian Portuguese Texts from RDF Data

1 code implementation LREC 2018 Diego Moussallem, Thiago castro Ferreira, Marcos Zampieri, Maria Claudia Cavalcanti, Geraldo Xexéo, Mariana Neves, Axel-Cyrille Ngonga Ngomo

The generation of natural language from Resource Description Framework (RDF) data has recently gained significant attention due to the continuous growth of Linked Data.

LIDIOMS: A Multilingual Linked Idioms Data Set

1 code implementation LREC 2018 Diego Moussallem, Mohamed Ahmed Sherif, Diego Esteves, Marcos Zampieri, Axel-Cyrille Ngonga Ngomo

In this paper, we describe the LIDIOMS data set, a multilingual RDF representation of idioms currently containing five languages: English, German, Italian, Portuguese, and Russian.

Detecting Hate Speech in Social Media

1 code implementation RANLP 2017 Shervin Malmasi, Marcos Zampieri

In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity.

General Classification

Compiling and Processing Historical and Contemporary Portuguese Corpora

no code implementations2 Oct 2017 Marcos Zampieri

This technical report describes the framework used for processing three large Portuguese corpora.

Linguistic Features of Genre and Method Variation in Translation: A Computational Perspective

no code implementations13 Sep 2017 Ekaterina Lapshninova-Koltunski, Marcos Zampieri

In this paper we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus.

Classification General Classification +2

Predicting the Law Area and Decisions of French Supreme Court Cases

no code implementations RANLP 2017 Octavia-Maria Sulea, Marcos Zampieri, Mihaela Vela, Josef van Genabith

In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court.

General Classification Text Classification

Native Language Identification on Text and Speech

no code implementations WS 2017 Marcos Zampieri, Alina Maria Ciobanu, Liviu P. Dinu

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI).

Native Language Identification

Including Dialects and Language Varieties in Author Profiling

no code implementations3 Jul 2017 Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Liviu P. Dinu

This paper presents a computational approach to author profiling taking gender and language variety into account.

Arabic Dialect Identification Using iVectors and ASR Transcripts

no code implementations WS 2017 Shervin Malmasi, Marcos Zampieri

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017.

Dialect Identification Machine Translation

Findings of the VarDial Evaluation Campaign 2017

no code implementations WS 2017 Marcos Zampieri, Shervin Malmasi, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann, Yves Scherrer, No{\"e}mi Aepli

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL{'}2017.

Dependency Parsing Dialect Identification

German Dialect Identification in Interview Transcriptions

no code implementations WS 2017 Shervin Malmasi, Marcos Zampieri

This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017.

Dialect Identification Machine Translation

Arabic Dialect Identification in Speech Transcripts

no code implementations WS 2016 Shervin Malmasi, Marcos Zampieri

In this paper we describe a system developed to identify a set of four regional Arabic dialects (Egyptian, Gulf, Levantine, North African) and Modern Standard Arabic (MSA) in a transcribed speech corpus.

Dialect Identification Machine Translation

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task

no code implementations WS 2016 Shervin Malmasi, Marcos Zampieri, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann

We present the results of the third edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial{'}2016 workshop at COLING{'}2016.

Dialect Identification General Classification

Discriminating Similar Languages: Evaluations and Explorations

no code implementations LREC 2016 Cyril Goutte, Serge Léger, Shervin Malmasi, Marcos Zampieri

We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties.

Modeling Language Change in Historical Corpora: The Case of Portuguese

no code implementations LREC 2016 Marcos Zampieri, Shervin Malmasi, Mark Dras

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification.

General Classification POS +1

CATaLog Online: Porting a Post-editing Tool to the Web

no code implementations LREC 2016 Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, Tapas Nayak, Mihaela Vela, Josef van Genabith

The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describe in detail in this paper.

Machine Translation Translation

VarClass: An Open-source Language Identification Tool for Language Varieties

no code implementations LREC 2014 Marcos Zampieri, Binyam Gebre

This paper presents VarClass, an open-source tool for language identification available both to be downloaded as well as through a graphical user-friendly interface.

Information Retrieval Language Identification +2

Cannot find the paper you are looking for? You can Submit a new open access paper.