Search Results for author: Nizar Habash

Found 159 papers, 15 papers with code

ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus

no code implementations LREC 2022 Nizar Habash, David Palfreyman

We present ZAEBUC, an annotated Arabic-English bilingual writer corpus comprising short essays by first-year university students at Zayed University in the United Arab Emirates.

Lemmatization Part-Of-Speech Tagging +2

The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic

no code implementations LREC 2022 Dana Abdulrahim, Go Inoue, Latifa Shamsan, Salam Khalifa, Nizar Habash

Our objective is to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.).

Camel Treebank: An Open Multi-genre Arabic Dependency Treebank

no code implementations LREC 2022 Nizar Habash, Muhammed AbuOdeh, Dima Taji, Reem Faraj, Jamila El Gizuli, Omar Kallas

We present the Camel Treebank (CAMELTB), a 188K word open-source dependency treebank of Modern Standard and Classical Arabic.

A Unified Model for Arabizi Detection and Transliteration using Sequence-to-Sequence Models

no code implementations COLING (WANLP) 2020 Ali Shazal, Aiza Usman, Nizar Habash

While online Arabic is primarily written using the Arabic script, a Roman-script variety called Arabizi is often seen on social media.

Transliteration

Gender-Aware Reinflection using Linguistically Enhanced Neural Models

1 code implementation GeBNLP (COLING) 2020 Bashar Alhafni, Nizar Habash, Houda Bouamor

In this paper, we present an approach for sentence-level gender reinflection using linguistically enhanced sequence-to-sequence models.

Grammatical Error Correction Sentence

A Cloud-based User-Centered Time-Offset Interaction Application

no code implementations SIGDIAL (ACL) 2021 Alberto Chierici, Tyeece Kiana Fredorcia Hensley, Wahib Kamran, Kertu Koss, Armaan Agrawal, Erin Meekhof, Goffredo Puccetti, Nizar Habash

Time-offset interaction applications (TOIA) allow simulating conversations with people who have previously recorded relevant video utterances, which are played in response to their interacting user.

A View From the Crowd: Evaluation Challenges for Time-Offset Interaction Applications

no code implementations EACL (HumEval) 2021 Alberto Chierici, Nizar Habash

Our contributions include the annotated dataset that we make publicly available and the proposal of Success Rate @k as an evaluation metric that is more appropriate than the traditional QA’s and information retrieval’s metrics.

Question Answering

Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals

no code implementations1 Feb 2024 Christian Khairallah, Reham Marzouk, Salam Khalifa, Mayar Nassar, Nizar Habash

Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously.

Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching

1 code implementation30 Jan 2024 Kurt Micallef, Nizar Habash, Claudia Borg, Fadhl Eryani, Houda Bouamor

Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model's pre-training data.

Cross-Lingual Transfer Transliteration

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

no code implementations23 Oct 2023 Injy Hamed, Nizar Habash, Ngoc Thang Vu

Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.

Data Augmentation Machine Translation +2

Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation

1 code implementation24 May 2023 Bashar Alhafni, Go Inoue, Christian Khairallah, Nizar Habash

We also define the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED.

Grammatical Error Detection

Camelira: An Arabic Multi-Dialect Morphological Disambiguator

no code implementations30 Nov 2022 Ossama Obeid, Go Inoue, Nizar Habash

We present Camelira, a web-based Arabic multi-dialect morphological disambiguation tool that covers four major variants of Arabic: Modern Standard Arabic, Egyptian, Gulf, and Levantine.

Dialect Identification Morphological Disambiguation

The Shared Task on Gender Rewriting

no code implementations22 Oct 2022 Bashar Alhafni, Nizar Habash, Houda Bouamor, Ossama Obeid, Sultan Alrowili, Daliyah AlZeer, Khawlah M. Alshanqiti, Ahmed ElBakry, Muhammad ElNokrashy, Mohamed Gabr, Abderrahmane Issam, Abdelrahim Qaddoumi, K. Vijay-Shanker, Mahmoud Zyate

In this paper, we present the results and findings of the Shared Task on Gender Rewriting, which was organized as part of the Seventh Arabic Natural Language Processing Workshop.

Sentence

The User-Aware Arabic Gender Rewriter

no code implementations14 Oct 2022 Bashar Alhafni, Ossama Obeid, Nizar Habash

We introduce the User-Aware Arabic Gender Rewriter, a user-centric web-based system for Arabic gender rewriting in contexts involving two users.

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

no code implementations25 May 2022 Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu

Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

UniMorph 4.0: Universal Morphology

no code implementations LREC 2022 Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.

Morphological Inflection

User-Centric Gender Rewriting

1 code implementation NAACL 2022 Bashar Alhafni, Nizar Habash, Houda Bouamor

In this paper, we define the task of gender rewriting in contexts involving two users (I and/or You) - first and second grammatical persons with independent grammatical gender preferences.

AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization

no code implementations21 Mar 2022 Moussa Kamal Eddine, Nadi Tomeh, Nizar Habash, Joseph Le Roux, Michalis Vazirgiannis

Like most natural language understanding and generation tasks, state-of-the-art models for summarization are transformer-based sequence-to-sequence architectures that are pretrained on large corpora.

Abstractive Text Summarization Natural Language Understanding

Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

1 code implementation Findings (ACL) 2022 Go Inoue, Salam Khalifa, Nizar Habash

We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models.

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

1 code implementation EACL (WANLP) 2021 Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, Nizar Habash

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models.

Language Modelling

NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task

1 code implementation EACL (WANLP) 2021 Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Houda Bouamor, Nizar Habash

This Shared Task includes four subtasks: country-level Modern Standard Arabic (MSA) identification (Subtask 1. 1), country-level dialect identification (Subtask 1. 2), province-level MSA identification (Subtask 2. 1), and province-level sub-dialect identification (Subtask 2. 2).

Dialect Identification

Multitask Easy-First Dependency Parsing: Exploiting Complementarities of Different Dependency Representations

no code implementations COLING 2020 Yash Kankanampati, Joseph Le Roux, Nadi Tomeh, Dima Taji, Nizar Habash

In this paper we present a parsing model for projective dependency trees which takes advantage of the existence of complementary dependency annotations which is the case in Arabic, with the availability of CATiB and UD treebanks.

Dependency Parsing

Utilizing Subword Entities in Character-Level Sequence-to-Sequence Lemmatization Models

no code implementations COLING 2020 Nasser Zalmout, Nizar Habash

In addition to generic n-gram embeddings (using FastText), we experiment with concatenative (stems) and templatic (roots and patterns) morphological subwords.

LEMMA Lemmatization

An Online Readability Leveled Arabic Thesaurus

no code implementations COLING 2020 Zhengyang Jiang, Nizar Habash, Muhamed Al Khalil

This demo paper introduces the online Readability Leveled Arabic Thesaurus interface.

NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task

no code implementations COLING (WANLP) 2020 Muhammad Abdul-Mageed, Chiyu Zhang, Houda Bouamor, Nizar Habash

The data for the shared task covers a total of 100 provinces from 21 Arab countries and are collected from the Twitter domain.

Dialect Identification

The Paradigm Discovery Problem

1 code implementation ACL 2020 Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan Cotterell, Nizar Habash

Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm.

Clustering Word Embeddings

The Margarita Dialogue Corpus: A Data Set for Time-Offset Interactions and Unstructured Dialogue Systems

no code implementations LREC 2020 Alberto Chierici, Nizar Habash, Margarita Bicec

The first challenges are to define a sensible methodology for data collection and to create useful data sets for training the system to retrieve the best answer to a user{'}s question.

Question Answering Retrieval

A Spelling Correction Corpus for Multiple Arabic Dialects

no code implementations LREC 2020 Fadhl Eryani, Nizar Habash, Houda Bouamor, Salam Khalifa

In this paper, we present the MADAR CODA Corpus, a collection of 10, 000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form.

Spelling Correction

Adversarial Multitask Learning for Joint Multi-Feature and Multi-Dialect Morphological Modeling

no code implementations ACL 2019 Nasser Zalmout, Nizar Habash

In this paper we explore the use of multitask learning and adversarial training to address morphological richness and dialectal variations in the context of full morphological tagging.

Morphological Tagging Transfer Learning

Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging

no code implementations ACL 2020 Nasser Zalmout, Nizar Habash

Semitic languages can be highly ambiguous, having several interpretations of the same surface forms, and morphologically rich, having many morphemes that realize several morphological features.

Lemmatization Morphological Tagging

The MADAR Shared Task on Arabic Fine-Grained Dialect Identification

no code implementations WS 2019 Houda Bouamor, Sabit Hassan, Nizar Habash

In this paper, we present the results and findings of the MADAR Shared Task on Arabic Fine-Grained Dialect Identification.

Dialect Identification

A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

no code implementations WS 2019 Alex Erdmann, er, Salam Khalifa, Mai Oudah, Nizar Habash, Houda Bouamor

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input.

Automatic Gender Identification and Reinflection in Arabic

no code implementations WS 2019 Nizar Habash, Houda Bouamor, Christine Chung

The impressive progress in many Natural Language Processing (NLP) applications has increased the awareness of some of the biases these NLP systems have with regards to gender identities.

Machine Translation Translation

Simple Automatic Post-editing for Arabic-Japanese Machine Translation

no code implementations14 Jul 2019 Ella Noll, Mai Oudah, Nizar Habash

A common bottleneck for developing machine translation (MT) systems for some language pairs is the lack of direct parallel translation data sets, in general and in certain domains.

Automatic Post-Editing Translation

The Effectiveness of Simple Hybrid Systems for Hypernym Discovery

no code implementations ACL 2019 William Held, Nizar Habash

Hypernymy modeling has largely been separated according to two paradigms, pattern-based methods and distributional methods.

Hypernym Discovery

ADIDA: Automatic Dialect Identification for Arabic

no code implementations NAACL 2019 Ossama Obeid, Mohammad Salameh, Houda Bouamor, Nizar Habash

This demo paper describes ADIDA, a web-based system for automatic dialect identification for Arabic text.

Dialect Identification

An Arabic Dependency Treebank in the Travel Domain

no code implementations29 Jan 2019 Dima Taji, Jamila El Gizuli, Nizar Habash

In this paper we present a dependency treebank of travel domain sentences in Modern Standard Arabic.

Translation

An Arabic Morphological Analyzer and Generator with Copious Features

no code implementations WS 2018 Dima Taji, Salam Khalifa, Ossama Obeid, Fadhl Eryani, Nizar Habash

We introduce CALIMA-Star, a very rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality and much more.

Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

no code implementations EMNLP 2018 Daniel Watson, Nasser Zalmout, Nizar Habash

We show that providing the model with word-level features bridges the gap for the neural network approach to achieve a state-of-the-art F1 score on a standard Arabic language correction shared task dataset.

Word Embeddings

Fine-Grained Arabic Dialect Identification

no code implementations COLING 2018 Mohammad Salameh, Houda Bouamor, Nizar Habash

Previous work on the problem of Arabic Dialect Identification typically targeted coarse-grained five dialect classes plus Standard Arabic (6-way classification).

Classification Dialect Identification +3

Improving Domain Independent Question Parsing with Synthetic Treebanks

no code implementations COLING 2018 Halim-Antoine Boukaram, Nizar Habash, Micheline Ziadee, Majd Sakr

Automatic syntactic parsing for question constructions is a challenging task due to the paucity of training examples in most treebanks.

Addressing Noise in Multidialectal Word Embeddings

no code implementations ACL 2018 Alex Erdmann, er, Nasser Zalmout, Nizar Habash

Arabic dialects lack large corpora and are noisy, being linguistically disparate with no standardized spelling.

Sentence Transliteration +1

Noise-Robust Morphological Disambiguation for Dialectal Arabic

no code implementations NAACL 2018 Nasser Zalmout, Alex Erdmann, er, Nizar Habash

User-generated text tends to be noisy with many lexical and orthographic inconsistencies, making natural language processing (NLP) tasks more challenging.

Lexical Normalization Morphological Analysis +3

Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic

no code implementations MTSummit 2017 Alexander Erdmann, Nizar Habash, Dima Taji, Houda Bouamor

We present the second ever evaluated Arabic dialect-to-dialect machine translation effort, and the first to leverage external resources beyond a small parallel corpus.

Machine Translation Translation

Don't Throw Those Morphological Analyzers Away Just Yet: Neural Morphological Disambiguation for Arabic

no code implementations EMNLP 2017 Nasser Zalmout, Nizar Habash

We make use of the resulting morphological models for scoring and ranking the analyses of the morphological analyzer for morphological disambiguation.

Feature Engineering Language Modelling +3

OMAM at SemEval-2017 Task 4: English Sentiment Analysis with Conditional Random Fields

no code implementations SEMEVAL 2017 Chukwuyem Onyibe, Nizar Habash

We describe a supervised system that uses optimized Condition Random Fields and lexical features to predict the sentiment of a tweet.

Opinion Mining Sentiment Analysis +1

Robust Dictionary Lookup in Multiple Noisy Orthographies

no code implementations WS 2017 Lingliang Zhang, Nizar Habash, Godfried Toussaint

We present the MultiScript Phonetic Search algorithm to address the problem of language learners looking up unfamiliar words that they heard.

Transliteration

CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation

no code implementations COLING 2016 Anas Shahrour, Salam Khalifa, Dima Taji, Nizar Habash

In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features.

Dependency Parsing Morphological Analysis +2

Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine

no code implementations COLING 2016 Esk, Ramy er, Nizar Habash, Owen Rambow, Arfath Pasha

Arabic dialects present a special problem for natural language processing because there are few resources, they have no standard orthography, and have not been studied much.

Morphological Analysis

Morphological Constraints for Phrase Pivot Statistical Machine Translation

no code implementations12 Sep 2016 Ahmed El Kholy, Nizar Habash

One common solution is to pivot through a third language for which there exist parallel corpora with the source and target languages.

Machine Translation Translation

A Large Scale Corpus of Gulf Arabic

no code implementations LREC 2016 Salam Khalifa, Nizar Habash, Dana Abdulrahim, Sara Hassan

Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World.

First Result on Arabic Neural Machine Translation

no code implementations8 Jun 2016 Amjad Almahairi, Kyunghyun Cho, Nizar Habash, Aaron Courville

Neural machine translation has become a major alternative to widely used phrase-based statistical machine translation.

Machine Translation Translation

Building an Arabic Machine Translation Post-Edited Corpus: Guidelines and Annotation

no code implementations LREC 2016 Wajdi Zaghouani, Nizar Habash, Ossama Obeid, Behrang Mohit, Houda Bouamor, Kemal Oflazer

We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic.

Machine Translation Translation

DALILA: The Dialectal Arabic Linguistic Learning Assistant

no code implementations LREC 2016 Salam Khalifa, Houda Bouamor, Nizar Habash

Dialectal Arabic (DA) poses serious challenges for Natural Language Processing (NLP).

Applying the Cognitive Machine Translation Evaluation Approach to Arabic

no code implementations LREC 2016 Irina Temnikova, Wajdi Zaghouani, Stephan Vogel, Nizar Habash

The goal of the cognitive machine translation (MT) evaluation approach is to build classifiers which assign post-editing effort scores to new texts.

Machine Translation Translation

Arabic Corpora for Credibility Analysis

no code implementations LREC 2016 Ayman Al Zaatari, Rim El Ballouli, Shady ELbassouni, Wassim El-Hajj, Hazem Hajj, Khaled Shaban, Nizar Habash, Emad Yahya

We focus on Arabic due to the recent popularity of blogs and microblogs in the Arab World and due to the lack of any such public corpora in Arabic.

BIG-bench Machine Learning General Classification

Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development

no code implementations LREC 2014 Mohamed Maamouri, Ann Bies, Seth Kulick, Michael Ciul, Nizar Habash, Esk, Ramy er

This paper describes the parallel development of an Egyptian Arabic Treebank and a morphological analyzer for Egyptian Arabic (CALIMA).

MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic

no code implementations LREC 2014 Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Esk, Ramy er, Nizar Habash, Manoj Pooleery, Owen Rambow, Ryan Roth

In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007).

Chunking Lemmatization +5

Large Scale Arabic Error Annotation: Guidelines and Framework

no code implementations LREC 2014 Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Ossama Obeid, Nadi Tomeh, Alla Rozovskaya, Noura Farra, Sarah Alkuhlani, Kemal Oflazer

Finally, we present the annotation tool that was developed as part of this project, the annotation pipeline, and the quality of the resulting annotations.

Machine Translation

LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual

no code implementations22 Sep 2013 Mona Diab, Nizar Habash, Owen Rambow, Ryan Roth

The Linguistic Data Consortium (LDC) has developed hundreds of data corpora for natural language processing (NLP) research.

Cannot find the paper you are looking for? You can Submit a new open access paper.