Text Clustering
32 papers with code • 3 benchmarks • 5 datasets
Grouping a set of texts in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). (Source: Adapted from Wikipedia)
Datasets
Latest papers
More Discriminative Sentence Embeddings via Semantic Graph Smoothing
This paper explores an empirical approach to learn more discriminantive sentence representations in an unsupervised fashion.
Elastic deep autoencoder for text embedding clustering by an improved graph regularization
In this jointly end-to-end deep learning model, better representation and text clustering results are achieved with high accuracy on common datasets compared to existing methods.
Large Language Models Enable Few-Shot Clustering
In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
ClusterLLM: Large Language Models as a Guide for Text Clustering
First, we prompt ChatGPT for insights on clustering perspective by constructing hard triplet questions <does A better correspond to B than C>, where A, B and C are similar data points that belong to different clusters according to small embedder.
Robust Representation Learning with Reliable Pseudo-labels Generation via Self-Adaptive Optimal Transport for Short Text Clustering
To tackle the above issues, we propose a Robust Short Text Clustering (RSTC) model to improve robustness against imbalanced and noisy data.
Influence of various text embeddings on clustering performance in NLP
For example, a three star rating (out of five) may be incongruous with the review text, which may be more suitable for a five star review.
DeepLens: Interactive Out-of-distribution Data Detection in NLP Models
In this work, we propose DeepLens, an interactive system that helps users detect and explore OOD issues in massive text corpora.
Very Large Language Model as a Unified Methodology of Text Mining
Text data mining is the process of deriving essential information from language text.
MTEB: Massive Text Embedding Benchmark
MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases
Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.