Text Classification
1104 papers with code • 150 benchmarks • 148 datasets
Text Classification is the task of assigning a sentence or document an appropriate category. The categories depend on the chosen dataset and can range from topics.
Text Classification problems include emotion classification, news classification, citation intent classification, among others. Benchmark datasets for evaluating text classification capabilities include GLUE, AGNews, among others.
In recent years, deep learning techniques like XLNet and RoBERTa have attained some of the biggest performance jumps for text classification problems.
( Image credit: Text Classification Algorithms: A Survey )
Libraries
Use these libraries to find Text Classification models and implementationsSubtasks
- Topic Models
- Document Classification
- Sentence Classification
- Emotion Classification
- Emotion Classification
- Multi-Label Text Classification
- Few-Shot Text Classification
- Text Categorization
- Semi-Supervised Text Classification
- Coherence Evaluation
- Toxic Comment Classification
- Citation Intent Classification
- Cross-Domain Text Classification
- Unsupervised Text Classification
- Satire Detection
- Hierarchical Text Classification of Blurbs (GermEval 2019)
- Variable Detection
Latest papers
PEACH: Pretrained-embedding Explanation Across Contextual and Hierarchical Structure
In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained contextual embeddings in a tree-based human-interpretable manner.
IMO: Greedy Layer-Wise Sparse Representation Learning for Out-of-Distribution Text Classification with Pre-trained Models
Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains.
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.
VI-OOD: A Unified Representation Learning Framework for Textual Out-of-distribution Detection
Out-of-distribution (OOD) detection plays a crucial role in ensuring the safety and reliability of deep neural networks in various applications.
Multi-Task Learning for Features Extraction in Financial Annual Reports
For assessing various performance indicators of companies, the focus is shifting from strictly financial (quantitative) publicly disclosed information to qualitative (textual) information.
Exploring the Trade-off Between Model Performance and Explanation Plausibility of Text Classifiers Using Human Rationales
By leveraging a multi-objective optimization algorithm, we explore the trade-off between the two loss functions and generate a Pareto-optimal frontier of models that balance performance and plausibility.
Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations
Zero-Shot Cross-lingual Transfer (ZS-XLT) utilizes a model trained in a source language to make predictions in another language, often with a performance loss.
FPT: Feature Prompt Tuning for Few-shot Readability Assessment
Our proposed method establishes a new architecture for prompt tuning that sheds light on how linguistic features can be easily adapted to linguistic-related tasks.
DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation
To address this issue, we propose a novel text dataset distillation approach, called Distilling dataset into Language Model (DiLM), which trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples.
HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification
Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information.