SimAdapter is a module for explicitly learning knowledge from adapters. SimAdapter aims to learn the similarities between the source and target languages during fine-tuning using the adapters, and the similarity is based on an attention mechanism.
The detailed composition of the SimAdapter is shown in the Figure. By taking the language-agnostic representations from the backbone model as the query, and the language-specific outputs from multiple adapter as the keys and values, the final output for SimAdapter over attention are computed as (For notation simplicity, we omit the layer index $l$ below):
$$ \operatorname{SimAdapter}\left(\mathbf{z}, \mathbf{a}_{\left(S_{1}, S_{2}, \ldots, S_{N}\right)}\right)=\sum_{i=1}^{N} \operatorname{Attn}\left(\mathbf{z}, \mathbf{a}_{S_{i}}\right) \cdot\left(\mathbf{a}_{S_{i}} \mathbf{W}_{V}\right) $$
where SimAdapter $(\cdot)$ and $\operatorname{Attn}(\cdot)$ denotes the SimAdapter and attention operations, respectively. Specifically, the attention operation is computed as:
$$ \operatorname{Attn}(\mathbf{z}, \mathbf{a})=\operatorname{Softmax}\left(\frac{\left(\mathbf{z} \mathbf{W}_{Q}\right)\left(\mathbf{a} \mathbf{W}_{K}\right)^{\top}}{\tau}\right) $$
where $\tau$ is the temperature coefficient, $\mathbf{W}_{Q}, \mathbf{W}_{K}, \mathbf{W}_{V}$ are attention matrices. Note that while $\mathbf{W}_{Q}, \mathbf{W}_{K}$ are initialized randomly, $\mathbf{W}_{V}$ is initialized with a diagonal of ones and the rest of the matrix with small weights $(1 e-6)$ to retain the adapter representations. Furthermore, a regularization term is introduced to avoid drastic feature changes:
$$ \mathcal{L}_{\mathrm{reg}}=\sum_{i, j}\left(\left(\mathbf{I}_{V}\right)_{i, j}-\left(\mathbf{W}_{V}\right)_{i, j}\right)^{2} $$
where $\mathbf{I}_{V}$ is the identity matrix with the same size as $\mathbf{W}_{V}$
Source: Exploiting Adapters for Cross-lingual Low-resource Speech RecognitionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
General Knowledge | 1 | 33.33% |
Meta-Learning | 1 | 33.33% |
Speech Recognition | 1 | 33.33% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |