Search Results for author: Xiaozhe Ren

Found 8 papers, 4 papers with code

One Student Knows All Experts Know: From Sparse to Dense

no code implementations26 Jan 2022 Fuzhao Xue, Xiaoxin He, Xiaozhe Ren, Yuxuan Lou, Yang You

We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation.

Knowledge Distillation

Large-Scale Deep Learning Optimizations: A Comprehensive Survey

no code implementations1 Nov 2021 Xiaoxin He, Fuzhao Xue, Xiaozhe Ren, Yang You

Deep learning have achieved promising results on a wide spectrum of AI applications.

EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation

1 code implementation Findings (EMNLP) 2021 Chenhe Dong, Guangrun Wang, Hang Xu, Jiefeng Peng, Xiaozhe Ren, Xiaodan Liang

In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2$\sim$3 times larger than MHA.

Data Augmentation Knowledge Distillation

NumGPT: Improving Numeracy Ability of Generative Pre-trained Models

no code implementations7 Sep 2021 Zhihua Jin, Xin Jiang, Xingbo Wang, Qun Liu, Yong Wang, Xiaozhe Ren, Huamin Qu

However, those models do not consider the numerical properties of numbers and cannot perform robustly on numerical reasoning tasks (e. g., math word problems and measurement estimation).

AutoBERT-Zero: Evolving BERT Backbone from Scratch

no code implementations15 Jul 2021 Jiahui Gao, Hang Xu, Han Shi, Xiaozhe Ren, Philip L. H. Yu, Xiaodan Liang, Xin Jiang, Zhenguo Li

Transformer-based pre-trained language models like BERT and its variants have recently achieved promising performance in various natural language processing (NLP) tasks.

Language Modelling Neural Architecture Search

SparseBERT: Rethinking the Importance Analysis in Self-attention

1 code implementation25 Feb 2021 Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, James T. Kwok

A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions.

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

2 code implementations31 Aug 2019 Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, Qun Liu

The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora.

Named Entity Recognition Natural Language Inference +3

Cannot find the paper you are looking for? You can Submit a new open access paper.