HateBR Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

The **HateBR dataset** is a significant resource for studying offensive language and hate speech detection in Brazilian Portuguese. Here are the key details about this dataset:

1. **Collection and Annotation**:
   - The HateBR dataset was **collected from Brazilian Instagram comments** related to politicians.
   - It was **manually annotated by specialists** who carefully labeled each comment.
   - The dataset consists of **7,000 documents**.

2. **Annotation Layers**:
   - The HateBR dataset includes annotations at three different levels:
     - **Binary Classification**: Comments are labeled as either **offensive** or **non-offensive**.
     - **Offensiveness Levels**: Comments are categorized as **highly**, **moderately**, or **slightly offensive**.
     - **Hate Speech Targets**: Comments are further classified into **nine** specific hate speech categories:
       - Xenophobia
       - Racism
       - Homophobia
       - Sexism
       - Religious intolerance
       - Partyism
       - Apology for the dictatorship
       - Antisemitism
       - Fatphobia

3. **Inter-Annotator Agreement**:
   - Each comment was annotated by **three different annotators** to ensure reliability.
   - The dataset achieved **high inter-annotator agreement**.

4. **Baseline Performance**:
   - Baseline experiments using machine learning models achieved an **F1-score of 85%**, outperforming existing baselines for Portuguese language hate speech datasets.

5. **Corpus and Models**:
   - The HateBR dataset includes a **corpus** of annotated comments.
   - The repository contains the **best models** presented in the associated research paper.

6. **File Format**:
   - The `HateBr.csv` file provides four columns:
     - 1st column: Instagram comments.
     - 2nd column: Offensive language classification (offensive vs. non-offensive).
     - 3rd column: Offensiveness level (highly, moderately, slightly offensive).
     - 4th column: Hate speech classification (nine different targets).

Source: Conversation with Bing, 3/16/2024
(1) HateBR - Offensive Language and Hate Speech Dataset in ... - GitHub. https://github.com/franciellevargas/HateBR.
(2) ruanchaves/hatebr · Datasets at Hugging Face. https://huggingface.co/datasets/ruanchaves/hatebr.
(3) Papers with Code - HateBR: Large expert annotated corpus of Brazilian .... https://paperswithcode.com/paper/hatebr-large-expert-annotated-corpus-of.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

HateBR

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

LeNER-Br

BRWAC

Usage

License

Modalities

Languages

HateBR

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit