Search Results for author: Marcos Zampieri

Found 117 papers, 20 papers with code

A Report on the VarDial Evaluation Campaign 2020

no code implementations • VarDial (COLING) 2020 • Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

Paper
Add Code

Transfer Learning Methods for Domain Adaptation in Technical Logbook Datasets

no code implementations • LREC 2022 • Farhad Akhbardeh, Marcos Zampieri, Cecilia Ovesdotter Alm, Travis Desell

Event identification in technical logbooks poses challenges given the limited logbook data available in specific technical domains, the large set of possible classes, and logbook entries typically being in short form and non-standard technical language.

Domain Adaptation Transfer Learning

Paper
Add Code

Findings of the VarDial Evaluation Campaign 2021

no code implementations • EACL (VarDial) 2021 • Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Ruba Priyadharshini, Christoph Purschke, Eswari Rajagopal, Yves Scherrer, Marcos Zampieri

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021.

Dialect Identification

Paper
Add Code

Neural Machine Translation for Similar Languages: The Case of Indo-Aryan Languages

no code implementations • WMT (EMNLP) 2020 • Santanu Pal, Marcos Zampieri

In this paper we present the WIPRO-RIT systems submitted to the Similar Language Translation shared task at WMT 2020.

Machine Translation Translation

Paper
Add Code

An Evaluation of Binary Comparative Lexical Complexity Models

no code implementations • NAACL (BEA) 2022 • Kai North, Marcos Zampieri, Matthew Shardlow

Identifying complex words in texts is an important first step in text simplification (TS) systems.

Lexical Complexity Prediction Sentence +1

Paper
Add Code

A Computational Exploration of Pejorative Language in Social Media

no code implementations • Findings (EMNLP) 2021 • Liviu P. Dinu, Ioan-Bogdan Iordache, Ana Sabina Uban, Marcos Zampieri

In this paper we study pejorative language, an under-explored topic in computational linguistics.

Word Sense Disambiguation

Paper
Add Code

Findings of the 2021 Conference on Machine Translation (WMT21)

no code implementations • WMT (EMNLP) 2021 • Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-Jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, Marcos Zampieri

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021. In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories.

Machine Translation Translation

Paper
Add Code

Classifying Human-Generated and AI-Generated Election Claims in Social Media

no code implementations • 24 Apr 2024 • Alphaeus Dmonte, Marcos Zampieri, Kevin Lybarger, Massimiliano Albanese, Genya Coulter

In this paper, we present a novel taxonomy for characterizing election-related claims.

Misinformation

Paper
Add Code

A Federated Learning Approach to Privacy Preserving Offensive Language Identification

no code implementations • 17 Apr 2024 • Marcos Zampieri, Damith Premasiri, Tharindu Ranasinghe

Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification.

Federated Learning Language Identification +1

Paper
Add Code

CSEPrompts: A Benchmark of Introductory Computer Science Prompts

no code implementations • 3 Apr 2024 • Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Christian Newman, Tharindu Ranasinghe, Marcos Zampieri

Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters.

Multiple-choice

Paper
Add Code

MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thoughts

no code implementations • 22 Mar 2024 • Md Nishat Raihan, Dhiman Goswami, Al Nahian Bin Emran, Sadiya Sayara Chowdhury Puspo, Amrita Ganguly, Marcos Zampieri

Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding.

Natural Language Understanding Sentence

Paper
Add Code

MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness

no code implementations • 22 Mar 2024 • Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Md Nishat Raihan, Al Nahian Bin Emran, Amrita Ganguly, Marcos Zampieri

This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness.

Sentence

Paper
Add Code

MultiLS: A Multi-task Lexical Simplification Framework

no code implementations • 22 Feb 2024 • Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri

We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset.

Lexical Complexity Prediction Lexical Simplification +1

Paper
Add Code

MasonPerplexity at Multimodal Hate Speech Event Detection 2024: Hate Speech and Target Detection Using Transformer Ensembles

no code implementations • 3 Feb 2024 • Amrita Ganguly, Al Nahian Bin Emran, Sadiya Sayara Chowdhury Puspo, Md Nishat Raihan, Dhiman Goswami, Marcos Zampieri

The automatic identification of offensive language such as hate speech is important to keep discussions civil in online communities.

Event Detection

Paper
Add Code

Health Text Simplification: An Annotated Corpus for Digestive Cancer Education and Novel Strategies for Reinforcement Learning

no code implementations • 26 Jan 2024 • Md Mushfiqur Rahman, Mohammad Sabik Irbaz, Kai North, Michelle S. Williams, Marcos Zampieri, Kevin Lybarger

Our innovative RLHF reward function surpassed existing RL text simplification reward functions in effectiveness.

Domain Adaptation Language Modelling +4

Paper
Add Code

A Text-to-Text Model for Multilingual Offensive Language Identification

no code implementations • 6 Dec 2023 • Tharindu Ranasinghe, Marcos Zampieri

Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish).

Decoder Language Identification +1

Paper
Add Code

Offensive Language Identification in Transliterated and Code-Mixed Bangla

no code implementations • 25 Nov 2023 • Md Nishat Raihan, Umma Hani Tanmoy, Anika Binte Islam, Kai North, Tharindu Ranasinghe, Antonios Anastasopoulos, Marcos Zampieri

Identifying offensive content in social media is vital for creating safe online communities.

Language Identification

Paper
Add Code

nlpBDpatriots at BLP-2023 Task 1: A Two-Step Classification for Violence Inciting Text Detection in Bangla

no code implementations • 25 Nov 2023 • Md Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri

In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP.

Text Detection Translation

Paper
Add Code

nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach to Bangla Sentiment Analysis

no code implementations • 25 Nov 2023 • Dhiman Goswami, Md Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri

In this paper, we discuss the nlpBDpatriots entry to the shared task on Sentiment Analysis of Bangla Social Media Posts organized at the first workshop on Bangla Language Processing (BLP) co-located with EMNLP.

Data Augmentation Sentiment Analysis +2

Paper
Add Code

OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

1 code implementation • 27 Oct 2023 • Dhiman Goswami, Md Nishat Raihan, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.

Language Identification

Paper
Code

SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis

1 code implementation • 27 Oct 2023 • Md Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.

Sentiment Analysis

Paper
Code

Findings of the VarDial Evaluation Campaign 2023

no code implementations • 31 May 2023 • Noëmi Aepli, Çağrı Çöltekin, Rob van der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023.

Intent Detection

Paper
Add Code

Deep Learning Approaches to Lexical Simplification: A Survey

no code implementations • 19 May 2023 • Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri

To reflect these recent advances, we present a comprehensive survey of papers published between 2017 and 2023 on LS and its sub-tasks with a special focus on deep learning.

Lexical Simplification Sentence +1

Paper
Add Code

Lexical Complexity Prediction: An Overview

no code implementations • 8 Mar 2023 • Kai North, Marcos Zampieri, Matthew Shardlow

Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English.

Lexical Complexity Prediction Reading Comprehension +1

Paper
Add Code

Language Variety Identification with True Labels

1 code implementation • 2 Mar 2023 • Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha Kumari, Nishant Nair, Yash Bangera

Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e. g., Croatian and Serbian) and national language varieties (e. g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety.

Language Identification

Paper
Code

Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

no code implementations • 6 Feb 2023 • Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng SHEANG, Matthew Shardlow, Kai North, Marcos Zampieri

We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part of the Workshop on Text Simplification, Accessibility, and Readability TSAR-2022 held in conjunction with EMNLP 2022.

Lexical Simplification Text Simplification

Paper
Add Code

Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive

2 code implementations • 29 Jan 2023 • Tharindu Cyril Weerasooriya, Sujan Dutta, Tharindu Ranasinghe, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhsh

For (2), we introduce a first-of-its-kind dataset of vicarious offense.

Language Modelling Large Language Model +1

Paper
Code

SOLD: Sinhala Offensive Language Dataset

1 code implementation • 1 Dec 2022 • Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri

SOLD is a manually annotated dataset containing 10, 000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models.

Language Identification Sentence

Paper
Code

Predicting the Type and Target of Offensive Social Media Posts in Marathi

1 code implementation • 22 Nov 2022 • Marcos Zampieri, Tharindu Ranasinghe, Mrinal Chaudhari, Saurabh Gaikwad, Prajwal Krishna, Mayuresh Nene, Shrunali Paygude

We introduce the Marathi Offensive Language Dataset v. 2. 0 or MOLD 2. 0 and present multiple experiments on this dataset.

Language Identification

Paper
Code

Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi

no code implementations • 18 Nov 2022 • Tharindu Ranasinghe, Kai North, Damith Premasiri, Marcos Zampieri

The widespread of offensive content online has become a reason for great concern in recent years, motivating researchers to develop robust systems capable of identifying such content automatically.

Language Identification

Paper
Add Code

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

no code implementations • COLING 2022 • Kai North, Marcos Zampieri, Tharindu Ranasinghe

To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9, 605 candidate substitutions for 387 complex words.

Lexical Simplification XLM-R

Paper
Add Code

Lexical Simplification Benchmarks for English, Portuguese, and Spanish

2 code implementations • 12 Sep 2022 • Sanja Stajner, Daniel Ferres, Matthew Shardlow, Kai North, Marcos Zampieri, Horacio Saggion

To showcase the usability of the dataset, we adapt two state-of-the-art lexical simplification systems with differing architectures (neural vs.\ non-neural) to all three languages (English, Spanish, and Brazilian Portuguese) and evaluate their performances on our new dataset.

Lexical Simplification

Paper
Code

Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

no code implementations • 17 Dec 2021 • Thomas Mandl, Sandip Modha, Gautam Kishore Shahi, Hiren Madhu, Shrey Satapara, Prasenjit Majumder, Johannes Schaefer, Tharindu Ranasinghe, Marcos Zampieri, Durgesh Nandini, Amit Kumar Jaiswal

This paper presents the HASOC subtrack for English, Hindi, and Marathi.

Binary Classification Classification

Paper
Add Code

FBERT: A Neural Transformer for Identifying Offensive Content

no code implementations • Findings (EMNLP) 2021 • Diptanu Sarkar, Marcos Zampieri, Tharindu Ranasinghe, Alexander Ororbia

Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media.

Language Identification XLM-R

Paper
Add Code

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

1 code implementation • RANLP 2021 • Saurabh Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, Christopher M. Homan

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically.

Language Identification Transfer Learning

Paper
Code

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

1 code implementation • 1 Sep 2021 • Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, Emily Hill

We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names.

Attribute Part-Of-Speech Tagging

Paper
Code

Handling Extreme Class Imbalance in Technical Logbook Datasets

no code implementations • ACL 2021 • Farhad Akhbardeh, Cecilia Ovesdotter Alm, Marcos Zampieri, Travis Desell

In this paper we focus on the problem of technical issue classification by considering logbook datasets from the automotive, aviation, and facilities maintenance domains.

Paper
Add Code

WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

no code implementations • GermEval 2021 • Skye Morgan, Tharindu Ranasinghe, Marcos Zampieri

This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media.

Paper
Add Code

SemEval-2021 Task 1: Lexical Complexity Prediction

no code implementations • SEMEVAL 2021 • Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, Marcos Zampieri

This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction.

Lexical Complexity Prediction Task 2

Paper
Add Code

An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

no code implementations • Findings (ACL) 2021 • Ana-Maria Bucur, Marcos Zampieri, Liviu P. Dinu

In this paper, we analyze the interplay between the use of offensive language and mental health.

Depression Detection Language Identification +1

Paper
Add Code

LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction

no code implementations • SEMEVAL 2021 • Abhinandan Desai, Kai North, Marcos Zampieri, Christopher M. Homan

This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP).

Lexical Complexity Prediction POS +1

Paper
Add Code

Multilingual Offensive Language Identification for Low-resource Languages

no code implementations • 12 May 2021 • Tharindu Ranasinghe, Marcos Zampieri

We report results of 0. 8415 F1 macro for Bengali in TRAC-2 shared task, 0. 8532 F1 macro for Danish and 0. 8701 F1 macro for Greek in OffensEval 2020, 0. 8568 F1 macro for Hindi in HASOC 2019 shared task and 0. 7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) showing that our approach compares favourably to the best systems submitted to recent shared tasks on these three languages.

Language Identification Transfer Learning +1

Paper
Add Code

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

1 code implementation • SEMEVAL 2021 • Tharindu Ranasinghe, Diptanu Sarkar, Marcos Zampieri, Alexander Ororbia

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms.

Toxic Spans Detection

Paper
Code

Domain-specific MT for Low-resource Languages: The case of Bambara-French

no code implementations • 31 Mar 2021 • Allahsera Auguste Tapo, Michael Leventhal, Sarah Luger, Christopher M. Homan, Marcos Zampieri

Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of parallel data.

BIG-bench Machine Learning Machine Translation +1

Paper
Add Code

Comparing Approaches to Dravidian Language Identification

no code implementations • EACL (VarDial) 2021 • Tommi Jauhiainen, Tharindu Ranasinghe, Marcos Zampieri

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop.

Dialect Identification text-classification +1

Paper
Add Code

MUDES: Multilingual Detection of Offensive Spans

1 code implementation • NAACL 2021 • Tharindu Ranasinghe, Marcos Zampieri

The interest in offensive content identification in social media has grown substantially in recent years.

Paper
Code

Predicting Lexical Complexity in English Texts: The Complex 2.0 Dataset

no code implementations • 17 Feb 2021 • Matthew Shardlow, Richard Evans, Marcos Zampieri

We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2. 0.

Complex Word Identification Lexical Complexity Prediction +1

Paper
Add Code

NLP Tools for Predictive Maintenance Records in MaintNet

no code implementations • Asian Chapter of the Association for Computational Linguistics 2020 • Farhad Akhbardeh, Travis Desell, Marcos Zampieri

Processing maintenance logbook records is an important step in the development of predictive maintenance systems.

Clustering POS +1

Paper
Add Code

Findings of the 2020 Conference on Machine Translation (WMT20)

no code implementations • EMNLP 2020 • Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, Marcos Zampieri

In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories.

Machine Translation Translation

Paper
Add Code

Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara

no code implementations • loresmt (AACL) 2020 • Allahsera Auguste Tapo, Bakary Coulibaly, Sébastien Diarra, Christopher Homan, Julia Kreutzer, Sarah Luger, Arthur Nagashima, Marcos Zampieri, Michael Leventhal

Low-resource languages present unique challenges to (neural) machine translation.

Machine Translation Translation

Paper
Add Code

WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

no code implementations • 1 Nov 2020 • Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu

This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020.

Language Identification Transfer Learning +1

Paper
Add Code

Multilingual Offensive Language Identification with Cross-lingual Embeddings

1 code implementation • EMNLP 2020 • Tharindu Ranasinghe, Marcos Zampieri

In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources.

Language Identification Transfer Learning +1

Paper
Code

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

no code implementations • SEMEVAL 2020 • Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, Çağrı Çöltekin

We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020).

Abusive Language Hate Speech Detection

Paper
Add Code

MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources

no code implementations • COLING 2020 • Farhad Akhbardeh, Travis Desell, Marcos Zampieri

Furthermore, it provides a way to encourage discussion on and sharing of new datasets and tools for logbook data analysis.

Clustering

Paper
Add Code

Evaluating Aggression Identification in Social Media

no code implementations • LREC 2020 • Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Marcos Zampieri

The task consisted of two sub-tasks - aggression identification (sub-task A) and gendered identification (sub-task B) - in three languages - Bangla, Hindi and English.

Aggression Identification

Paper
Add Code

CompLex --- A New Corpus for Lexical Complexity Prediction from Likert Scale Data

no code implementations • LREC 2020 • Matthew Shardlow, Michael Cooper, Marcos Zampieri

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such astext simplification.

Binary Classification Complex Word Identification +1

Paper
Add Code

SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

no code implementations • Findings (ACL) 2021 • Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri, Preslav Nakov

The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression.

Language Identification

Paper
Add Code

Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study

no code implementations • 31 Mar 2020 • Michael Leventhal, Allahsera Tapo, Sarah Luger, Marcos Zampieri, Christopher M. Homan

We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages.

BIG-bench Machine Learning Machine Translation +1

Paper
Add Code

CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data

1 code implementation • 16 Mar 2020 • Matthew Shardlow, Michael Cooper, Marcos Zampieri

With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text.

Binary Classification Complex Word Identification +2

Paper
Code

Offensive Language Identification in Greek

1 code implementation • LREC 2020 • Zeses Pitenis, Marcos Zampieri, Tharindu Ranasinghe

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc.

Language Identification

Paper
Code

UDS--DFKI Submission to the WMT2019 Similar Language Translation Shared Task

no code implementations • 16 Aug 2019 • Santanu Pal, Marcos Zampieri, Josef van Genabith

The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish.

Translation

Paper
Add Code

Improving CAT Tools in the Translation Workflow: New Approaches and Evaluation

no code implementations • WS 2019 • Mihaela Vela, Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, Josef van Genabith

User feedback revealed that the users preferred using CATaLog Online over existing CAT tools in some respects, especially by selecting the output of the MT system and taking advantage of the color scheme for TM suggestions.

Automatic Post-Editing Management +1

Paper
Add Code

Findings of the 2019 Conference on Machine Translation (WMT19)

no code implementations • WS 2019 • Lo{\"\i}c Barrault, Ond{\v{r}}ej Bojar, Marta R. Costa-juss{\`a}, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias M{\"u}ller, Santanu Pal, Matt Post, Marcos Zampieri

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.

Machine Translation Translation

Paper
Add Code

UDS--DFKI Submission to the WMT2019 Czech--Polish Similar Language Translation Shared Task

no code implementations • WS 2019 • Santanu Pal, Marcos Zampieri, Josef van Genabith

The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish.

Translation

Paper
Add Code

A Report on the Third VarDial Evaluation Campaign

no code implementations • WS 2019 • Marcos Zampieri, Shervin Malmasi, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, Tommi Jauhiainen

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019.

Dialect Identification Morphological Analysis

Paper
Add Code

Experiments in Cuneiform Language Identification

no code implementations • WS 2019 • Gustavo Henrique Paetzold, Marcos Zampieri

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world.

Language Identification

Paper
Add Code

UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks

no code implementations • SEMEVAL 2019 • Gustavo Henrique Paetzold, Shervin Malmasi, Marcos Zampieri

We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset.

Paper
Add Code

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

2 code implementations • SEMEVAL 2019 • Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval).

Language Identification

Paper
Code

Predicting the Type and Target of Offensive Posts in Social Media

2 code implementations • NAACL 2019 • Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media.

Language Identification Vocal Bursts Type Prediction

Paper
Code

Classifying Patent Applications with Ensemble Methods

no code implementations • ALTA 2018 • Fernando Benites, Shervin Malmasi, Marcos Zampieri

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications.

Classification General Classification

Paper
Add Code

Classifier Ensembles for Dialect and Language Variety Identification

no code implementations • 14 Aug 2018 • Liviu P. Dinu, Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi

In this paper we present ensemble-based systems for dialect and language variety identification using the datasets made available by the organizers of the VarDial Evaluation Campaign 2018.

Dialect Identification

Paper
Add Code

Benchmarking Aggression Identification in Social Media

no code implementations • COLING 2018 • Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Marcos Zampieri

For this task, the participants were provided with a dataset of 15, 000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation.

Aggression Identification Benchmarking

Paper
Add Code

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

no code implementations • COLING 2018 • Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Nikola Ljube{\v{s}}i{\'c}, J{\"o}rg Tiedemann, Chris van der Lee, Stefan Grondelaers, Nelleke Oostdijk, Dirk Speelman, Antal Van den Bosch, Ritesh Kumar, Bornini Lahiri, Mayank Jain

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects.

Dependency Parsing Dialect Identification

Paper
Add Code

Discriminating between Indo-Aryan Languages Using SVM Ensembles

no code implementations • COLING 2018 • Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Santanu Pal, Liviu P. Dinu

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi.

Language Identification

Paper
Add Code

A Neural Approach to Language Variety Translation

no code implementations • COLING 2018 • Marta R. Costa-jussà, Marcos Zampieri, Santanu Pal

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language.

Machine Translation Translation

Paper
Add Code

A Portuguese Native Language Identification Dataset

no code implementations • WS 2018 • Iria del Río, Marcos Zampieri, Shervin Malmasi

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing.

Language Acquisition Native Language Identification +1

Paper
Add Code

A Report on the Complex Word Identification Shared Task 2018

no code implementations • WS 2018 • Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo H. Paetzold, Lucia Specia, Sanja Štajner, Anaïs Tack, Marcos Zampieri

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT'2018.

Binary Classification Classification +2

Paper
Add Code

Automatic Language Identification in Texts: A Survey

1 code implementation • 22 Apr 2018 • Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in.

Language Identification

Paper
Code

Challenges in Discriminating Profanity from Hate Speech

no code implementations • 14 Mar 2018 • Shervin Malmasi, Marcos Zampieri

In this study we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered.

Clustering General Classification

Paper
Add Code

LIDIOMS: A Multilingual Linked Idioms Data Set

1 code implementation • LREC 2018 • Diego Moussallem, Mohamed Ahmed Sherif, Diego Esteves, Marcos Zampieri, Axel-Cyrille Ngonga Ngomo

In this paper, we describe the LIDIOMS data set, a multilingual RDF representation of idioms currently containing five languages: English, German, Italian, Portuguese, and Russian.

Paper
Code

RDF2PT: Generating Brazilian Portuguese Texts from RDF Data

1 code implementation • LREC 2018 • Diego Moussallem, Thiago castro Ferreira, Marcos Zampieri, Maria Claudia Cavalcanti, Geraldo Xexéo, Mariana Neves, Axel-Cyrille Ngonga Ngomo

The generation of natural language from Resource Description Framework (RDF) data has recently gained significant attention due to the continuous growth of Linked Data.

Paper
Code

Detecting Hate Speech in Social Media

1 code implementation • RANLP 2017 • Shervin Malmasi, Marcos Zampieri

In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity.

General Classification

Paper
Code

Exploring the Use of Text Classification in the Legal Domain

no code implementations • 25 Oct 2017 • Octavia-Maria Sulea, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, Josef van Genabith

In this paper, we investigate the application of text classification methods to support law professionals.

General Classification text-classification +1

Paper
Add Code

Complex Word Identification: Challenges in Data Annotation and System Performance

no code implementations • WS 2017 • Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, Lucia Specia

This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task.

Complex Word Identification General Classification

Paper
Add Code

Compiling and Processing Historical and Contemporary Portuguese Corpora

no code implementations • 2 Oct 2017 • Marcos Zampieri

This technical report describes the framework used for processing three large Portuguese corpora.

Paper
Add Code

Linguistic Features of Genre and Method Variation in Translation: A Computational Perspective

no code implementations • 13 Sep 2017 • Ekaterina Lapshninova-Koltunski, Marcos Zampieri

In this paper we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus.

General Classification text-classification +2

Paper
Add Code

Predicting the Law Area and Decisions of French Supreme Court Cases

no code implementations • RANLP 2017 • Octavia-Maria Sulea, Marcos Zampieri, Mihaela Vela, Josef van Genabith

In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court.

General Classification text-classification +1

Paper
Add Code

Native Language Identification on Text and Speech

no code implementations • WS 2017 • Marcos Zampieri, Alina Maria Ciobanu, Liviu P. Dinu

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI).

Native Language Identification

Paper
Add Code

Including Dialects and Language Varieties in Author Profiling

no code implementations • 3 Jul 2017 • Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Liviu P. Dinu

This paper presents a computational approach to author profiling taking gender and language variety into account.

Paper
Add Code

Arabic Dialect Identification Using iVectors and ASR Transcripts

no code implementations • WS 2017 • Shervin Malmasi, Marcos Zampieri

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017.

Dialect Identification Machine Translation

Paper
Add Code

German Dialect Identification in Interview Transcriptions

no code implementations • WS 2017 • Shervin Malmasi, Marcos Zampieri

This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017.

Dialect Identification Machine Translation

Paper
Add Code

Findings of the VarDial Evaluation Campaign 2017

no code implementations • WS 2017 • Marcos Zampieri, Shervin Malmasi, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann, Yves Scherrer, No{\"e}mi Aepli

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL{'}2017.

Dependency Parsing Dialect Identification

Paper
Add Code

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task

no code implementations • WS 2016 • Shervin Malmasi, Marcos Zampieri, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann

We present the results of the third edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial{'}2016 workshop at COLING{'}2016.

Dialect Identification General Classification +1

Paper
Add Code

Arabic Dialect Identification in Speech Transcripts

no code implementations • WS 2016 • Shervin Malmasi, Marcos Zampieri

In this paper we describe a system developed to identify a set of four regional Arabic dialects (Egyptian, Gulf, Levantine, North African) and Modern Standard Arabic (MSA) in a transcribed speech corpus.

Dialect Identification Machine Translation

Paper
Add Code

CATaLog Online: A Web-based CAT Tool for Distributed Translation with Data Capture for APE and Translation Process Research

no code implementations • COLING 2016 • Santanu Pal, Sudip Kumar Naskar, Marcos Zampieri, Tapas Nayak, Josef van Genabith

We present a free web-based CAT tool called CATaLog Online which provides a novel and user-friendly online CAT environment for post-editors/translators.

Automatic Post-Editing Translation

Paper
Add Code

Discriminating Similar Languages: Evaluations and Explorations

no code implementations • LREC 2016 • Cyril Goutte, Serge Léger, Shervin Malmasi, Marcos Zampieri

We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties.

BIG-bench Machine Learning

Paper
Add Code

Modeling Language Change in Historical Corpora: The Case of Portuguese

no code implementations • LREC 2016 • Marcos Zampieri, Shervin Malmasi, Mark Dras

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification.

General Classification POS +2

Paper
Add Code

Findings of the 2016 Conference on Machine Translation

no code implementations • WS 2016 • Ond{\v{r}}ej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aur{\'e}lie N{\'e}v{\'e}ol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, Marcos Zampieri

Automatic Post-Editing Multimodal Machine Translation +1

Paper
Add Code

USAAR: An Operation Sequential Model for Automatic Statistical Post-Editing

no code implementations • WS 2016 • Santanu Pal, Marcos Zampieri, Josef van Genabith

Automatic Post-Editing Word Alignment

Paper
Add Code

LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles

no code implementations • SEMEVAL 2016 • Shervin Malmasi, Mark Dras, Marcos Zampieri

Complex Word Identification Language Modelling +3

Paper
Add Code

MAZA at SemEval-2016 Task 11: Detecting Lexical Complexity Using a Decision Stump Meta-Classifier

no code implementations • SEMEVAL 2016 • Shervin Malmasi, Marcos Zampieri

Complex Word Identification Lexical Simplification +2

Paper
Add Code

MacSaar at SemEval-2016 Task 11: Zipfian and Character Features for ComplexWord Identification

no code implementations • SEMEVAL 2016 • Marcos Zampieri, Liling Tan, Josef van Genabith

Complex Word Identification Lexical Simplification +1

Paper
Add Code

Predicting Post Severity in Mental Health Forums

no code implementations • WS 2016 • Shervin Malmasi, Marcos Zampieri, Mark Dras

Paper
Add Code

CATaLog Online: Porting a Post-editing Tool to the Web

no code implementations • LREC 2016 • Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, Tapas Nayak, Mihaela Vela, Josef van Genabith

The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describe in detail in this paper.

Machine Translation Management +1

Paper
Add Code

CATaLog: New Approaches to TM and Post Editing Interfaces

no code implementations • WS 2015 • Tapas Nayek, Sudip Kumar Naskar, Santanu Pal, Marcos Zampieri, Mihaela Vela, Josef van Genabith

Machine Translation

Paper
Add Code

Overview of the DSL Shared Task 2015

no code implementations • WS 2015 • Marcos Zampieri, Liling Tan, Nikola Ljube{\v{s}}i{\'c}, J{\"o}rg Tiedemann, Preslav Nakov

Language Identification

Paper
Add Code

Comparing Approaches to the Identification of Similar Languages

no code implementations • WS 2015 • Marcos Zampieri, Binyam Gebrekidan Gebre, Hernani Costa, Josef van Genabith

Language Identification

Paper
Add Code

AMBRA: A Ranking Approach to Temporal Text Classification

no code implementations • SEMEVAL 2015 • Marcos Zampieri, Alina Maria Ciobanu, Vlad Niculae, Liviu P. Dinu

General Classification Information Retrieval +2

Paper
Add Code

Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation

no code implementations • WS 2015 • Carolina Scarton, Marcos Zampieri, Mihaela Vela, Josef van Genabith, Lucia Specia

Machine Translation Translation

Paper
Add Code

Can Translation Memories afford not to use paraphrasing?

no code implementations • WS 2015 • Rohit Gupta, Constantin Or{\u{a}}san, Marcos Zampieri, Mihaela Vela, Josef van Genabith

Semantic Textual Similarity Translation

Paper
Add Code

A Report on the DSL Shared Task 2014

no code implementations • WS 2014 • Marcos Zampieri, Liling Tan, Nikola Ljube{\v{s}}i{\'c}, J{\"o}rg Tiedemann

Language Identification

Paper
Add Code

VarClass: An Open-source Language Identification Tool for Language Varieties

no code implementations • LREC 2014 • Marcos Zampieri, Binyam Gebre

This paper presents VarClass, an open-source tool for language identification available both to be downloaded as well as through a graphical user-friendly interface.

Information Retrieval Language Identification +2