Regularly varying representation for sentence embedding

The dominant approaches to sentence representation in natural language rely on learning embeddings on massive corpuses. The obtained embeddings have desirable properties such as compositionality and distance preservation (sentences with similar meanings have similar representations). In this paper, we develop a novel method for learning an embedding enjoying a dilation invariance property. We propose two algorithms: Orthrus, a classification algorithm, constrains the distribution of the embedded variable to be regularly varying, i.e. multivariate heavy-tail. and uses Extreme Value Theory (EVT) to tackle the classification task on two separate regions: the tail and the bulk. Hydra, a text generation algorithm for dataset augmentation, leverages the invariance property of the embedding learnt by Orthrus to generate coherent sentences with controllable attribute, e.g. positive or negative sentiment. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here