no code implementations • 25 Oct 2024 • Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren
In this paper, we propose Ripple, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory.
no code implementations • 19 Oct 2024 • Hao Wu, Donglin Bai, Shiqi Jiang, Qianxi Zhang, Yifan Yang, Ting Cao, Fengyuan Xu
Video understanding has become increasingly important with the rise of multi-modality applications.
no code implementations • 17 Oct 2024 • Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang
Attention is the cornerstone of modern Large Language Models (LLMs).
1 code implementation • 25 Sep 2024 • Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang
Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits).
no code implementations • 12 Aug 2024 • Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang
Then, LUT Tensor Core proposes the hardware design featuring an elongated tiling shape design to enhance table reuse and a bit-serial design to support various precision combinations in mpGEMM.
no code implementations • 25 Jul 2024 • Shenghong Dai, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, Suman Banerjee, Lili Qiu
To tackle this challenge, we introduce the Babel framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies.
1 code implementation • 25 Jun 2024 • Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang
Weight quantization is crucial for reducing the memory footprint of LLMs on devices.
2 code implementations • 16 Feb 2024 • Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu
The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges.
no code implementations • 8 Feb 2024 • QiPeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, Xuanzhe Liu
The gap on mobile CPU and mobile GPU is 15. 8 times and 7. 8 times, respectively.
1 code implementation • 3 Nov 2023 • Yijia Zhang, Sicheng Zhang, Shijie Cao, Dayou Du, Jianyu Wei, Ting Cao, Ningyi Xu
Large language models (LLMs) show great performance in various tasks, but face deployment challenges from limited memory capacity and bandwidth.
no code implementations • 16 Sep 2023 • Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang
Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent.
1 code implementation • 23 Aug 2023 • Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang
To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements.
1 code implementation • 26 Jun 2023 • Junyan Li, Li Lyna Zhang, Jiahang Xu, Yujing Wang, Shaoguang Yan, Yunqing Xia, Yuqing Yang, Ting Cao, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang
Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length.
no code implementations • 31 May 2023 • Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu
In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 31 May 2023 • Yijia Zhang, Yibo Han, Shijie Cao, Guohao Dai, Youshan Miao, Ting Cao, Fan Yang, Ningyi Xu
We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients.
no code implementations • 21 May 2023 • Yijia Zhang, Lingran Zhao, Shijie Cao, WenQiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu
In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution.
1 code implementation • ICCV 2023 • Chen Tang, Li Lyna Zhang, Huiqiang Jiang, Jiahang Xu, Ting Cao, Quanlu Zhang, Yuqing Yang, Zhi Wang, Mao Yang
However, prior supernet training methods that rely on uniform sampling suffer from the gradient conflict issue: the sampled subnets can have vastly different model sizes (e. g., 50M vs. 2G FLOPs), leading to different optimization directions and inferior performance.
1 code implementation • ICCV 2023 • Li Lyna Zhang, Xudong Wang, Jiahang Xu, Quanlu Zhang, Yujing Wang, Yuqing Yang, Ningxin Zheng, Ting Cao, Mao Yang
The combination of Neural Architecture Search (NAS) and quantization has proven successful in automatically designing low-FLOPs INT8 quantized neural networks (QNN).
no code implementations • 7 Feb 2023 • Xiaohu Tang, Yang Wang, Ting Cao, Li Lyna Zhang, Qi Chen, Deng Cai, Yunxin Liu, Mao Yang
On-device Deep Neural Network (DNN) inference consumes significant computing resources and development efforts.
no code implementations • 30 Aug 2022 • Li Lyna Zhang, Youkow Homma, Yujing Wang, Min Wu, Mao Yang, Ruofei Zhang, Ting Cao, Wei Shen
Remarkably, under our latency requirement of 1900us on CPU, SwiftPruner achieves a 0. 86% higher AUC than the state-of-the-art uniform sparse baseline for BERT-Mini on a large scale real-world dataset.
no code implementations • 29 Jun 2022 • Yan Lu, Shiqi Jiang, Ting Cao, Yuanchao Shu
Edge computing is being widely used for video analytics.
1 code implementation • 15 Jun 2022 • Rongjie Yi, Ting Cao, Ao Zhou, Xiao Ma, Shangguang Wang, Mengwei Xu
DNNs are ubiquitous on edge devices nowadays.
no code implementations • 30 Nov 2021 • Ting Cao, Mohammad Ali Armin, Simon Denman, Lars Petersson, David Ahmedt-Aristizabal
Medical applications have benefited greatly from the rapid advancement in computer vision.
no code implementations • 8 Feb 2021 • Fangzhou Zhao, Ting Cao, Steven G. Louie
Graphene nanoribbons (GNRs) possess distinct symmetry-protected topological phases.
Mesoscale and Nanoscale Physics Materials Science Computational Physics