Electric is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.
Specifically, like BERT, Electric also models $p_{\text {data }}\left(x_{t} \mid \mathbf{x}_{\backslash t}\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\mathbf{x}=\left[x_{1}, \ldots, x_{n}\right]$ into contextualized vector representations $\mathbf{h}(\mathbf{x})=\left[\mathbf{h}_{1}, \ldots, \mathbf{h}_{n}\right]$ using a transformer network. The model assigns a given position $t$ an energy score
$$ E(\mathbf{x})_{t}=\mathbf{w}^{T} \mathbf{h}(\mathbf{x})_{t} $$
using a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as
$$ p_{\theta}\left(x_{t} \mid \mathbf{x}_{\backslash t}\right)=\exp \left(-E(\mathbf{x})_{t}\right) / Z\left(\mathbf{x}_{\backslash t}\right) $$
$$ =\frac{\exp \left(-E(\mathbf{x})_{t}\right)}{\sum_{x^{\prime} \in \mathcal{V}} \exp \left(-E\left(\operatorname{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)\right)_{t}\right)} $$
where $\text{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)$ denotes replacing the token at position $t$ with $x^{\prime}$ and $\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\prime}$ using a softmax layer, a candidate $x^{\prime}$ is passed in as input to the transformer. As a result, computing $p_{\theta}$ is prohibitively expensive because the partition function $Z_{\theta}\left(\mathbf{x}_{\backslash t}\right)$ requires running the transformer $|\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z_{\theta}(\mathbf{x} \backslash t)$ is more due to the expensive scoring function rather than having a large sample space.
Source: Pre-Training Transformers as Energy-Based Cloze ModelsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Management | 24 | 20.51% |
energy management | 10 | 8.55% |
Reinforcement Learning (RL) | 8 | 6.84% |
Deep Reinforcement Learning | 7 | 5.98% |
Model Predictive Control | 4 | 3.42% |
Anomaly Detection | 4 | 3.42% |
Load Forecasting | 3 | 2.56% |
Intrusion Detection | 3 | 2.56% |
Autonomous Driving | 3 | 2.56% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |