Weight Tying Explained | Papers With Code

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**Weight Tying** improves the performance of language models by tying (sharing) the weights of the embedding and [softmax](https://paperswithcode.com/method/softmax) layers. This method also massively reduces the total number of parameters in the language models that it is applied to.

Language models are typically comprised of an embedding layer, followed by a number of [Transformer](https://paperswithcode.com/method/transformer) or [LSTM](https://paperswithcode.com/method/lstm) layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models.

This method was independently introduced by [Press & Wolf, 2016](https://paperswithcode.com/paper/using-the-output-embedding-to-improve) and [Inan et al, 2016](https://paperswithcode.com/paper/tying-word-vectors-and-word-classifiers-a).

Additionally, the Press & Wolf paper proposes Three-way Weight Tying, a method for NMT models in which the embedding matrix for the source language, the embedding matrix for the target language, and the softmax matrix for the target language are all tied. That method has been adopted by the Attention Is All You Need model and many other neural machine translation models.

Code Snippet URL (optional):

Image

Currently: methods/Screen_Shot_2020-11-11_at_8.57.05_PM_5FtuWCH.png Clear
Change:

Attached collections:

PARAMETER SHARING

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Language Modelling	23	18.11%
General Classification	15	11.81%
Text Classification	13	10.24%
Classification	8	6.30%
Sentiment Analysis	8	6.30%
Translation	7	5.51%
Machine Translation	5	3.94%
Language Identification	4	3.15%
Hate Speech Detection	3	2.36%

Weight Tying

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove