Text Clustering

32 papers with code • 3 benchmarks • 5 datasets

Grouping a set of texts in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). (Source: Adapted from Wikipedia)

Most implemented papers

Dissimilarity Mixture Autoencoder for Deep Clustering

larajuse/DMAE 15 Jun 2020

The dissimilarity mixture autoencoder (DMAE) is a neural network model for feature-based clustering that incorporates a flexible dissimilarity function and can be integrated into any kind of deep learning architecture.

Discovering New Intents with Deep Aligned Clustering

thuiar/DeepAligned-Clustering 16 Dec 2020

In this work, we propose an effective method, Deep Aligned Clustering, to discover new intents with the aid of the limited known intent data.

Supporting Clustering with Contrastive Learning

amazon-research/sccl NAACL 2021

Unsupervised clustering aims at discovering the semantic categories of data according to some distance measured in the representation space.

Proposition-Level Clustering for Multi-Document Summarization

oriern/procluster NAACL 2022

Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition.

MTEB: Massive Text Embedding Benchmark

embeddings-benchmark/mteb 13 Oct 2022

MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.

Clustering Urdu News Using Headlines

SyedMuhammadFaheem/Urdu-News-Clustering 23 2015

This paper that proposes and evaluates a new algorithm to automatically cluster Urdu news from different news agencies.

Self-Taught Convolutional Neural Networks for Short Text Clustering

jacoxu/STC2 1 Jan 2017

Short text clustering is a challenging problem due to its sparseness of text representation.

ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

elki-project/elki 10 Feb 2019

We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version.

On the Use of ArXiv as a Dataset

mattbierbaum/arxiv-public-datasets 30 Apr 2019

We use this pipeline to extract and analyze a 6. 7 million edge citation graph, with an 11 billion word corpus of full-text research articles.