TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Sentiment Analysis	IMDb	Doc2VecC	Accuracy	88.3	# 37
Semantic Similarity	SICK	Doc2VecC	MSE	0.3053	# 5
Semantic Similarity	SICK	Doc2VecC	Pearson Correlation	0.8381	# 5
Semantic Similarity	SICK	Doc2VecC	Spearman Correlation	0.7621	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-vector-representation-for-documents/semantic-similarity-on-sick)](https://paperswithcode.com/sota/semantic-similarity-on-sick?p=efficient-vector-representation-for-documents)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-vector-representation-for-documents/sentiment-analysis-on-imdb)](https://paperswithcode.com/sota/sentiment-analysis-on-imdb?p=efficient-vector-representation-for-documents)`

Efficient Vector Representation for Documents through Corruption

8 Jul 2017 · Minmin Chen ·

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

PDF Abstract

Code

Add Remove Mark official

mchen24/iclr2017 official

188

Tasks

Add Remove

Document Classification

Representation Learning

Sentiment Analysis

Word Embeddings

Datasets

IMDb Movie Reviews

SICK

BookCorpus

Results from the Paper

Edit

Ranked #5 on Semantic Similarity on SICK

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Sentiment Analysis	IMDb	Doc2VecC	Accuracy	88.3	# 37	Compare
Semantic Similarity	SICK	Doc2VecC	MSE	0.3053	# 5	Compare
			Pearson Correlation	0.8381	# 5	Compare
			Spearman Correlation	0.7621	# 5	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Efficient Vector Representation for Documents through Corruption

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove