Search Results for author: Marcos Zampieri

Found 112 papers, 20 papers with code

Transfer Learning Methods for Domain Adaptation in Technical Logbook Datasets

no code implementations LREC 2022 Farhad Akhbardeh, Marcos Zampieri, Cecilia Ovesdotter Alm, Travis Desell

Event identification in technical logbooks poses challenges given the limited logbook data available in specific technical domains, the large set of possible classes, and logbook entries typically being in short form and non-standard technical language.

Domain Adaptation Transfer Learning

Findings of the 2021 Conference on Machine Translation (WMT21)

no code implementations WMT (EMNLP) 2021 Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-Jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, Marcos Zampieri

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021. In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories.

Machine Translation Translation

A Report on the VarDial Evaluation Campaign 2020

no code implementations VarDial (COLING) 2020 Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

A Text-to-Text Model for Multilingual Offensive Language Identification

no code implementations6 Dec 2023 Tharindu Ranasinghe, Marcos Zampieri

Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish).

Language Identification XLM-R

nlpBDpatriots at BLP-2023 Task 1: A Two-Step Classification for Violence Inciting Text Detection in Bangla

no code implementations25 Nov 2023 Md Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri

In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP.

Text Detection Translation

nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach to Bangla Sentiment Analysis

no code implementations25 Nov 2023 Dhiman Goswami, Md Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri

In this paper, we discuss the nlpBDpatriots entry to the shared task on Sentiment Analysis of Bangla Social Media Posts organized at the first workshop on Bangla Language Processing (BLP) co-located with EMNLP.

Data Augmentation Sentiment Analysis +2

Deep Learning Approaches to Lexical Simplification: A Survey

no code implementations19 May 2023 Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri

To reflect these recent advances, we present a comprehensive survey of papers published between 2017 and 2023 on LS and its sub-tasks with a special focus on deep learning.

Lexical Simplification Sentence +1

Lexical Complexity Prediction: An Overview

no code implementations8 Mar 2023 Kai North, Marcos Zampieri, Matthew Shardlow

Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English.

Lexical Complexity Prediction Reading Comprehension +1

Language Variety Identification with True Labels

1 code implementation2 Mar 2023 Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha Kumari, Nishant Nair, Yash Bangera

Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e. g., Croatian and Serbian) and national language varieties (e. g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety.

Language Identification

Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

no code implementations6 Feb 2023 Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng SHEANG, Matthew Shardlow, Kai North, Marcos Zampieri

We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part of the Workshop on Text Simplification, Accessibility, and Readability TSAR-2022 held in conjunction with EMNLP 2022.

Lexical Simplification Text Simplification

SOLD: Sinhala Offensive Language Dataset

1 code implementation1 Dec 2022 Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri

SOLD is a manually annotated dataset containing 10, 000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models.

Language Identification Sentence

Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi

no code implementations18 Nov 2022 Tharindu Ranasinghe, Kai North, Damith Premasiri, Marcos Zampieri

The widespread of offensive content online has become a reason for great concern in recent years, motivating researchers to develop robust systems capable of identifying such content automatically.

Language Identification

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

no code implementations COLING 2022 Kai North, Marcos Zampieri, Tharindu Ranasinghe

To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9, 605 candidate substitutions for 387 complex words.

Lexical Simplification XLM-R

Lexical Simplification Benchmarks for English, Portuguese, and Spanish

2 code implementations12 Sep 2022 Sanja Stajner, Daniel Ferres, Matthew Shardlow, Kai North, Marcos Zampieri, Horacio Saggion

To showcase the usability of the dataset, we adapt two state-of-the-art lexical simplification systems with differing architectures (neural vs.\ non-neural) to all three languages (English, Spanish, and Brazilian Portuguese) and evaluate their performances on our new dataset.

Lexical Simplification

FBERT: A Neural Transformer for Identifying Offensive Content

no code implementations Findings (EMNLP) 2021 Diptanu Sarkar, Marcos Zampieri, Tharindu Ranasinghe, Alexander Ororbia

Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media.

Language Identification XLM-R

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

1 code implementation1 Sep 2021 Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, Emily Hill

We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names.

Attribute Part-Of-Speech Tagging

Handling Extreme Class Imbalance in Technical Logbook Datasets

no code implementations ACL 2021 Farhad Akhbardeh, Cecilia Ovesdotter Alm, Marcos Zampieri, Travis Desell

In this paper we focus on the problem of technical issue classification by considering logbook datasets from the automotive, aviation, and facilities maintenance domains.

Multilingual Offensive Language Identification for Low-resource Languages

no code implementations12 May 2021 Tharindu Ranasinghe, Marcos Zampieri

We report results of 0. 8415 F1 macro for Bengali in TRAC-2 shared task, 0. 8532 F1 macro for Danish and 0. 8701 F1 macro for Greek in OffensEval 2020, 0. 8568 F1 macro for Hindi in HASOC 2019 shared task and 0. 7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) showing that our approach compares favourably to the best systems submitted to recent shared tasks on these three languages.

Language Identification Transfer Learning +1

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

1 code implementation SEMEVAL 2021 Tharindu Ranasinghe, Diptanu Sarkar, Marcos Zampieri, Alexander Ororbia

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms.

Toxic Spans Detection

Comparing Approaches to Dravidian Language Identification

no code implementations EACL (VarDial) 2021 Tommi Jauhiainen, Tharindu Ranasinghe, Marcos Zampieri

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop.

Dialect Identification text-classification +1

MUDES: Multilingual Detection of Offensive Spans

1 code implementation NAACL 2021 Tharindu Ranasinghe, Marcos Zampieri

The interest in offensive content identification in social media has grown substantially in recent years.

WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

no code implementations1 Nov 2020 Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu

This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020.

Language Identification Transfer Learning +1

Multilingual Offensive Language Identification with Cross-lingual Embeddings

1 code implementation EMNLP 2020 Tharindu Ranasinghe, Marcos Zampieri

In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources.

Language Identification Transfer Learning +1

MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources

no code implementations COLING 2020 Farhad Akhbardeh, Travis Desell, Marcos Zampieri

Furthermore, it provides a way to encourage discussion on and sharing of new datasets and tools for logbook data analysis.

Clustering

CompLex --- A New Corpus for Lexical Complexity Prediction from Likert Scale Data

no code implementations LREC 2020 Matthew Shardlow, Michael Cooper, Marcos Zampieri

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such astext simplification.

Binary Classification Complex Word Identification +1

Evaluating Aggression Identification in Social Media

no code implementations LREC 2020 Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Marcos Zampieri

The task consisted of two sub-tasks - aggression identification (sub-task A) and gendered identification (sub-task B) - in three languages - Bangla, Hindi and English.

Aggression Identification

SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

no code implementations Findings (ACL) 2021 Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri, Preslav Nakov

The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression.

Language Identification

Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study

no code implementations31 Mar 2020 Michael Leventhal, Allahsera Tapo, Sarah Luger, Marcos Zampieri, Christopher M. Homan

We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages.

BIG-bench Machine Learning Machine Translation +1

CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data

1 code implementation16 Mar 2020 Matthew Shardlow, Michael Cooper, Marcos Zampieri

With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text.

Binary Classification Complex Word Identification +2

Offensive Language Identification in Greek

1 code implementation LREC 2020 Zeses Pitenis, Marcos Zampieri, Tharindu Ranasinghe

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc.

Language Identification

UDS--DFKI Submission to the WMT2019 Similar Language Translation Shared Task

no code implementations16 Aug 2019 Santanu Pal, Marcos Zampieri, Josef van Genabith

The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish.

Translation

Improving CAT Tools in the Translation Workflow: New Approaches and Evaluation

no code implementations WS 2019 Mihaela Vela, Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, Josef van Genabith

User feedback revealed that the users preferred using CATaLog Online over existing CAT tools in some respects, especially by selecting the output of the MT system and taking advantage of the color scheme for TM suggestions.

Automatic Post-Editing Management +1

UDS--DFKI Submission to the WMT2019 Czech--Polish Similar Language Translation Shared Task

no code implementations WS 2019 Santanu Pal, Marcos Zampieri, Josef van Genabith

The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish.

Translation

A Report on the Third VarDial Evaluation Campaign

no code implementations WS 2019 Marcos Zampieri, Shervin Malmasi, Yves Scherrer, Tanja Samard{\v{z}}i{\'c}, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, Tommi Jauhiainen

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019.

Dialect Identification Morphological Analysis

Experiments in Cuneiform Language Identification

no code implementations WS 2019 Gustavo Henrique Paetzold, Marcos Zampieri

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world.

Language Identification

UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks

no code implementations SEMEVAL 2019 Gustavo Henrique Paetzold, Shervin Malmasi, Marcos Zampieri

We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset.

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

2 code implementations SEMEVAL 2019 Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval).

Language Identification

Classifying Patent Applications with Ensemble Methods

no code implementations ALTA 2018 Fernando Benites, Shervin Malmasi, Marcos Zampieri

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications.

Classification General Classification

Classifier Ensembles for Dialect and Language Variety Identification

no code implementations14 Aug 2018 Liviu P. Dinu, Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi

In this paper we present ensemble-based systems for dialect and language variety identification using the datasets made available by the organizers of the VarDial Evaluation Campaign 2018.

Dialect Identification

Benchmarking Aggression Identification in Social Media

no code implementations COLING 2018 Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Marcos Zampieri

For this task, the participants were provided with a dataset of 15, 000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation.

Aggression Identification Benchmarking

Discriminating between Indo-Aryan Languages Using SVM Ensembles

no code implementations COLING 2018 Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Santanu Pal, Liviu P. Dinu

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi.

Language Identification

A Neural Approach to Language Variety Translation

no code implementations COLING 2018 Marta R. Costa-jussà, Marcos Zampieri, Santanu Pal

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language.

Machine Translation Translation

A Portuguese Native Language Identification Dataset

no code implementations WS 2018 Iria del Río, Marcos Zampieri, Shervin Malmasi

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing.

Language Acquisition Native Language Identification +1

Automatic Language Identification in Texts: A Survey

1 code implementation22 Apr 2018 Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in.

Language Identification

Challenges in Discriminating Profanity from Hate Speech

no code implementations14 Mar 2018 Shervin Malmasi, Marcos Zampieri

In this study we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered.

Clustering General Classification

LIDIOMS: A Multilingual Linked Idioms Data Set

1 code implementation LREC 2018 Diego Moussallem, Mohamed Ahmed Sherif, Diego Esteves, Marcos Zampieri, Axel-Cyrille Ngonga Ngomo

In this paper, we describe the LIDIOMS data set, a multilingual RDF representation of idioms currently containing five languages: English, German, Italian, Portuguese, and Russian.

RDF2PT: Generating Brazilian Portuguese Texts from RDF Data

1 code implementation LREC 2018 Diego Moussallem, Thiago castro Ferreira, Marcos Zampieri, Maria Claudia Cavalcanti, Geraldo Xexéo, Mariana Neves, Axel-Cyrille Ngonga Ngomo

The generation of natural language from Resource Description Framework (RDF) data has recently gained significant attention due to the continuous growth of Linked Data.

Detecting Hate Speech in Social Media

1 code implementation RANLP 2017 Shervin Malmasi, Marcos Zampieri

In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity.

General Classification

Compiling and Processing Historical and Contemporary Portuguese Corpora

no code implementations2 Oct 2017 Marcos Zampieri

This technical report describes the framework used for processing three large Portuguese corpora.

Linguistic Features of Genre and Method Variation in Translation: A Computational Perspective

no code implementations13 Sep 2017 Ekaterina Lapshninova-Koltunski, Marcos Zampieri

In this paper we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus.

General Classification text-classification +2

Predicting the Law Area and Decisions of French Supreme Court Cases

no code implementations RANLP 2017 Octavia-Maria Sulea, Marcos Zampieri, Mihaela Vela, Josef van Genabith

In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court.

General Classification text-classification +1

Native Language Identification on Text and Speech

no code implementations WS 2017 Marcos Zampieri, Alina Maria Ciobanu, Liviu P. Dinu

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI).

Native Language Identification

Including Dialects and Language Varieties in Author Profiling

no code implementations3 Jul 2017 Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Liviu P. Dinu

This paper presents a computational approach to author profiling taking gender and language variety into account.

German Dialect Identification in Interview Transcriptions

no code implementations WS 2017 Shervin Malmasi, Marcos Zampieri

This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017.

Dialect Identification Machine Translation

Findings of the VarDial Evaluation Campaign 2017

no code implementations WS 2017 Marcos Zampieri, Shervin Malmasi, Nikola Ljube{\v{s}}i{\'c}, Preslav Nakov, Ahmed Ali, J{\"o}rg Tiedemann, Yves Scherrer, No{\"e}mi Aepli

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL{'}2017.

Dependency Parsing Dialect Identification

Arabic Dialect Identification Using iVectors and ASR Transcripts

no code implementations WS 2017 Shervin Malmasi, Marcos Zampieri

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017.

Dialect Identification Machine Translation

Arabic Dialect Identification in Speech Transcripts

no code implementations WS 2016 Shervin Malmasi, Marcos Zampieri

In this paper we describe a system developed to identify a set of four regional Arabic dialects (Egyptian, Gulf, Levantine, North African) and Modern Standard Arabic (MSA) in a transcribed speech corpus.

Dialect Identification Machine Translation

Discriminating Similar Languages: Evaluations and Explorations

no code implementations LREC 2016 Cyril Goutte, Serge Léger, Shervin Malmasi, Marcos Zampieri

We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties.

BIG-bench Machine Learning

Modeling Language Change in Historical Corpora: The Case of Portuguese

no code implementations LREC 2016 Marcos Zampieri, Shervin Malmasi, Mark Dras

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification.

General Classification POS +2

CATaLog Online: Porting a Post-editing Tool to the Web

no code implementations LREC 2016 Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, Tapas Nayak, Mihaela Vela, Josef van Genabith

The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describe in detail in this paper.

Machine Translation Management +1

VarClass: An Open-source Language Identification Tool for Language Varieties

no code implementations LREC 2014 Marcos Zampieri, Binyam Gebre

This paper presents VarClass, an open-source tool for language identification available both to be downloaded as well as through a graphical user-friendly interface.

Information Retrieval Language Identification +2

Cannot find the paper you are looking for? You can Submit a new open access paper.