no code implementations • 23 Feb 2025 • Zhili Feng, Dhananjay Ram, Cole Hawkins, Aditya Rawal, Jinman Zhao, Sheng Zha
The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks.
no code implementations • 2 Sep 2024 • Soumajyoti Sarkar, Leonard Lausen, Volkan Cevher, Sheng Zha, Thomas Brox, George Karypis
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
no code implementations • 21 Jun 2024 • Dhananjay Ram, Aditya Rawal, Momchil Hardalov, Nikolaos Pappas, Sheng Zha
Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models.
2 code implementations • 28 Feb 2024 • Zhiqi Bu, Xinwei Zhang, Mingyi Hong, Sheng Zha, George Karypis
The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection.
no code implementations • 27 Feb 2024 • Vyas Raina, Samson Tan, Volkan Cevher, Aditya Rawal, Sheng Zha, George Karypis
Deep learning-based Natural Language Processing (NLP) models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify.
no code implementations • 20 Nov 2023 • Zhiqi Bu, Justin Chiu, Ruixuan Liu, Sheng Zha, George Karypis
Deep learning using large models have achieved great success in a wide range of domains.
no code implementations • 30 Oct 2023 • Zhiqi Bu, Ruixuan Liu, Yu-Xiang Wang, Sheng Zha, George Karypis
Recent advances have substantially improved the accuracy, memory cost, and training speed of differentially private (DP) deep learning, especially on large vision and language models with millions to billions of parameters.
no code implementations • 19 Oct 2023 • Qingru Zhang, Dhananjay Ram, Cole Hawkins, Sheng Zha, Tuo Zhao
These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence.
no code implementations • 2 Oct 2023 • Ruixuan Liu, Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, George Karypis
The success of large neural networks is crucially determined by the availability of data.
1 code implementation • NeurIPS 2023 • Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, George Karypis
Language models pretrained on large collections of tabular data have demonstrated their effectiveness in several downstream tasks.
1 code implementation • NeurIPS 2023 • Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, George Karypis
We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs.
no code implementations • 1 Jun 2023 • Hengzhi Pei, Jinman Zhao, Leonard Lausen, Sheng Zha, George Karypis
To better solve this task, we query a program analyzer for information relevant to a given function call, and consider ways to provide the analyzer results to different code completion models during inference and training.
no code implementations • 8 Nov 2022 • Soumajyoti Sarkar, Kaixiang Lin, Sailik Sengupta, Leonard Lausen, Sheng Zha, Saab Mansour
While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants.
2 code implementations • 30 Sep 2022 • Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, George Karypis
Our implementation achieves state-of-the-art (SOTA) accuracy with very small extra cost: on GPT2 and at almost the same memory cost (<1% overhead), BK has 1. 03X the time complexity of the standard training (0. 83X training speed in practice), and 0. 61X the time complexity of the most efficient DP implementation (1. 36X training speed in practice).
2 code implementations • 30 Sep 2022 • Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, George Karypis
We study the problem of differentially private (DP) fine-tuning of large pre-trained models -- a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data.
no code implementations • NAACL 2022 • Vishakh Padmakumar, Leonard Lausen, Miguel Ballesteros, Sheng Zha, He He, George Karypis
Recent work has found that multi-task training with a large number of diverse tasks can uniformly improve downstream performance on unseen target tasks.
1 code implementation • ACL 2022 • Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, He He
The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples.
1 code implementation • 24 Jun 2020 • Shuai Zheng, Haibin Lin, Sheng Zha, Mu Li
Using the proposed LANS method and the learning rate scheme, we scaled up the mini-batch sizes to 96K and 33K in phases 1 and 2 of BERT pretraining, respectively.
1 code implementation • WS 2019 • He He, Sheng Zha, Haohan Wang
We first learn a biased model that only uses features that are known to relate to dataset bias.
2 code implementations • 9 Jul 2019 • Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, Yi Zhu
We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating).
2 code implementations • 26 Apr 2019 • Haibin Lin, Hang Zhang, Yifei Ma, Tong He, Zhi Zhang, Sheng Zha, Mu Li
One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes.
no code implementations • ECCV 2018 • Yang Shi, Tommaso Furlanello, Sheng Zha, Animashree Anandkumar
Visual Question Answering (VQA) requires integration of feature maps with drastically different structures and focus of the correct regions.