no code implementations • ACL 2021 • Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong
In this paper, we propose to efficiently increase the capacity for multilingual NMT by increasing the cardinality.
no code implementations • ACL 2021 • Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong, Meng Zhang
This has to be computed n times for a sequence of length n. The linear transformations involved in the LSTM gate and state computations are the major cost factors in this.
no code implementations • Findings (EMNLP) 2021 • Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong
The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
no code implementations • 13 Jul 2020 • Hongfei Xu, Yang song, Qiuhui Liu, Josef van Genabith, Deyi Xiong
Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance.
no code implementations • ACL 2020 • Hongfei Xu, Josef van Genabith, Deyi Xiong, Qiuhui Liu, Jingyi Zhang
Considering that modeling phrases instead of words has significantly improved the Statistical Machine Translation (SMT) approach through the use of larger translation blocks ("phrases") and its reordering ability, modeling NMT at phrase level is an intuitive proposal to help the model capture long-distance relationships.
no code implementations • ACL 2020 • Hongfei Xu, Josef van Genabith, Deyi Xiong, Qiuhui Liu
We propose to automatically and dynamically determine batch sizes by accumulating gradients of mini-batches and performing an optimization step at just the time when the direction of gradients starts to fluctuate.
no code implementations • NAACL 2021 • Hongfei Xu, Josef van Genabith, Qiuhui Liu, Deyi Xiong
Due to its effectiveness and performance, the Transformer translation model has attracted wide attention, most recently in terms of probing-based approaches.
no code implementations • ACL 2020 • Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong, Jingyi Zhang
In this paper, we first empirically demonstrate that a simple modification made in the official implementation, which changes the computation order of residual connection and layer normalization, can significantly ease the optimization of deep Transformers.
no code implementations • WS 2019 • Hongfei Xu, Qiuhui Liu, Josef van Genabith
In this paper, we describe our submission to the English-German APE shared task at WMT 2019.
2 code implementations • 18 Mar 2019 • Hongfei Xu, Qiuhui Liu
The Transformer translation model is easier to parallelize and provides better performance compared to recurrent seq2seq models, which makes it popular among industry and research community.