1 code implementation • 27 Nov 2024 • Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao
We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format.
no code implementations • 20 Nov 2024 • Zichen Wen, Dadi Guo, Huishuai Zhang
As large language models (LLMs) rapidly advance and integrate into daily life, the privacy risks they pose are attracting increasing attention.
1 code implementation • 2 Sep 2024 • Yueqian Wang, Jianxin Liang, Yuxuan Wang, Huishuai Zhang, Dongyan Zhao
To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters.
1 code implementation • 28 Aug 2024 • Danlong Yuan, Jiahao Liu, Bei Li, Huishuai Zhang, Jingang Wang, Xunliang Cai, Dongyan Zhao
While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models.
no code implementations • 27 Aug 2024 • Haowei Du, Huishuai Zhang, Dongyan Zhao
To address the hallucination in generative question answering (GQA) where the answer can not be derived from the document, we propose a novel evidence-enhanced triplet generation framework, EATQA, encouraging the model to predict all the combinations of (Question, Evidence, Answer) triplet by flipping the source pair and the target label to understand their logical relationships, i. e., predict Answer(A), Question(Q), and Evidence(E) given a QE, EA, and QA pairs, respectively.
no code implementations • 9 Jul 2024 • Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan
More interestingly, with a fixed parameter budget, MoM-large enables an over 38% increase in depth for computation graphs compared to GPT-2-large, resulting in absolute gains of 1. 4 on GLUE and 1 on XSUM.
no code implementations • 21 Jun 2024 • Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen
This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution.
1 code implementation • 26 May 2024 • Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang
Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs).
1 code implementation • 22 May 2024 • Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, Dongyan Zhao
This paper introduces xRAG, an innovative context compression method tailored for retrieval-augmented generation.
no code implementations • 18 Apr 2024 • Chao Zhou, Huishuai Zhang, Jiang Bian, Weiming Zhang, Nenghai Yu
To mitigate this, we propose the \copyright Plug-in Authorization framework, introducing three operations: addition, extraction, and combination.
no code implementations • 22 Mar 2024 • Bohan Wang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Wei Chen
This paper aims to clearly distinguish between Stochastic Gradient Descent with Momentum (SGDM) and Adam in terms of their convergence rates.
2 code implementations • 4 Mar 2024 • Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin
Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models.
no code implementations • 14 Dec 2023 • Kai Qiu, Huishuai Zhang, Zhirong Wu, Stephen Lin
However, the model robustness, which is a critical aspect for safety, is often optimized for each specific task rather than at the pretraining stage.
no code implementations • 25 Nov 2023 • Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun
Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive.
1 code implementation • NeurIPS 2023 • Puheng Li, Zhong Li, Huishuai Zhang, Jiang Bian
This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization.
no code implementations • 27 Oct 2023 • Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, Wei Chen
Recently, Arjevani et al. [1] established a lower bound of iteration complexity for the first-order optimization under an $L$-smooth condition and a bounded noise variance assumption.
no code implementations • 9 Jul 2023 • Zihao Jiang, Yunkai Dang, Dong Pang, Huishuai Zhang, Weiran Huang
Few-shot learning aims to train models that can be generalized to novel classes with only a few samples.
no code implementations • 15 Jun 2023 • Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, Nanning Zheng
In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD.
1 code implementation • 3 Jun 2023 • Hangting Ye, Zhining Liu, Xinyi Shen, Wei Cao, Shun Zheng, Xiaofan Gui, Huishuai Zhang, Yi Chang, Jiang Bian
This is a challenging task given the heterogeneous model structures and assumptions adopted by existing UAD methods.
no code implementations • 29 May 2023 • Bohan Wang, Huishuai Zhang, Zhi-Ming Ma, Wei Chen
We provide a simple convergence proof for AdaGrad optimizing non-convex objectives under only affine noise variance and bounded smoothness assumptions.
1 code implementation • 23 May 2023 • Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zinan Lin, Saurabh Naik, Tomasz Lukasz Religa, Jian Yin, Huishuai Zhang
In this work, we show that a careful pre-training on a \emph{subset} of the public dataset that is guided by the private dataset is crucial to train small language models with differential privacy.
1 code implementation • 28 Apr 2023 • Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan
In this paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits their advantages while avoids their limitations.
no code implementations • 3 Dec 2022 • Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu, Jiang Bian
To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization.
1 code implementation • 10 Oct 2022 • Quanlin Wu, Hang Ye, Yuntian Gu, Huishuai Zhang, LiWei Wang, Di He
In this paper, we propose a new self-supervised method, which is called Denoising Masked AutoEncoders (DMAE), for learning certified robust classifiers of images.
no code implementations • 21 Aug 2022 • Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, Wei Chen
We present the first convergence analysis of RR Adam without the bounded smoothness assumption.
no code implementations • 27 Jun 2022 • Xiaodong Yang, Huishuai Zhang, Wei Chen, Tie-Yan Liu
By ensuring differential privacy in the learning algorithms, one can rigorously mitigate the risk of large models memorizing sensitive training data.
no code implementations • 9 Jun 2022 • Huishuai Zhang, Da Yu, Yiping Lu, Di He
Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks.
1 code implementation • 6 Jun 2022 • Da Yu, Gautam Kamath, Janardhan Kulkarni, Tie-Yan Liu, Jian Yin, Huishuai Zhang
Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning.
no code implementations • 22 May 2022 • Jingwei Yi, Fangzhao Wu, Huishuai Zhang, Bin Zhu, Tao Qi, Guangzhong Sun, Xing Xie
Federated learning (FL) enables multiple clients to collaboratively train models without sharing their local data, and becomes an important privacy-preserving machine learning framework.
1 code implementation • 1 Nov 2021 • Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-Yan Liu
We are the first to unveil an important population property of the perturbations of these attacks: they are almost \textbf{linearly separable} when assigned with the target labels of the corresponding samples, which hence can work as \emph{shortcuts} for the learning objective.
no code implementations • NeurIPS 2021 • Bohan Wang, Huishuai Zhang, Jieyu Zhang, Qi Meng, Wei Chen, Tie-Yan Liu
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance if both the prior and the posterior are jointly optimized.
2 code implementations • ICLR 2022 • Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, Huishuai Zhang
For example, on the MNLI dataset we achieve an accuracy of $87. 8\%$ using RoBERTa-Large and $83. 5\%$ using RoBERTa-Base with a privacy budget of $\epsilon = 6. 7$.
no code implementations • 8 Oct 2021 • Bohan Wang, Qi Meng, Huishuai Zhang, Ruoyu Sun, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
The momentum acceleration technique is widely adopted in many optimization algorithms.
no code implementations • 29 Sep 2021 • Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, Masashi Sugiyama
Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection.
no code implementations • 29 Sep 2021 • Yichi Zhou, Shihong Song, Huishuai Zhang, Jun Zhu, Wei Chen, Tie-Yan Liu
In contextual bandit, one major challenge is to develop theoretically solid and empirically efficient algorithms for general function classes.
no code implementations • 29 Jun 2021 • Yichi Zhou, Shihong Song, Huishuai Zhang, Jun Zhu, Wei Chen, Tie-Yan Liu
However, it is in general unknown how to deriveefficient and effective EE trade-off methods for non-linearcomplex tasks, suchas contextual bandit with deep neural network as the reward function.
1 code implementation • 17 Jun 2021 • Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-Yan Liu
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks, which are 1) the huge memory cost of storing individual gradients, 2) the added noise suffering notorious dimensional dependence.
1 code implementation • CVPR 2022 • Tianyu Pang, Huishuai Zhang, Di He, Yinpeng Dong, Hang Su, Wei Chen, Jun Zhu, Tie-Yan Liu
Along with this routine, we find that confidence and a rectified confidence (R-Con) can form two coupled rejection metrics, which could provably distinguish wrongly classified inputs from correctly classified ones.
no code implementations • NeurIPS 2021 • Bohan Wang, Huishuai Zhang, Jieyu Zhang, Qi Meng, Wei Chen, Tie-Yan Liu
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance if both the prior and the posterior are jointly optimized.
2 code implementations • ICLR 2021 • Da Yu, Huishuai Zhang, Wei Chen, Tie-Yan Liu
The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism.
no code implementations • 8 Jan 2021 • Mingyang Yi, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
However, it has been pointed out that the usual definitions of sharpness, which consider either the maxima or the integral of loss over a $\delta$ ball of parameters around minima, cannot give consistent measurement for scale invariant neural networks, e. g., networks with batch normalization layer.
no code implementations • 1 Jan 2021 • Huishuai Zhang, Da Yu, Wei Chen, Tie-Yan Liu
More importantly, we propose a new design ``STAM aggregation" that can guarantee to STAbilize the forward/backward process of Multi-branch networks irrespective of the number of branches.
1 code implementation • 21 Jul 2020 • Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-Yan Liu
Even further, we show that the proposed approach can achieve higher MI attack success rates on models trained with some data augmentation than the existing methods on models trained without data augmentation.
1 code implementation • 29 Jun 2020 • Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, Masashi Sugiyama
Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and minima selection.
9 code implementations • ICML 2020 • Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Li-Wei Wang, Tie-Yan Liu
This motivates us to remove the warm-up stage for the training of Pre-LN Transformers.
no code implementations • 26 Nov 2019 • Da Yu, Huishuai Zhang, Wei Chen, Tie-Yan Liu, Jian Yin
By using the \emph{expected curvature}, we show that gradient perturbation can achieve a significantly improved utility guarantee that can theoretically justify the advantage of gradient perturbation over other perturbation methods.
no code implementations • 25 Sep 2019 • Mingyang Yi, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
It has widely shown that adversarial training (Madry et al., 2018) is effective in defending adversarial attack empirically.
no code implementations • 25 Sep 2019 • Huishuai Zhang, Da Yu, Mingyang Yi, Wei Chen, Tie-Yan Liu
We show that for standard initialization used in practice, $\tau =1/\Omega(\sqrt{L})$ is a sharp value in characterizing the stability of forward/backward process of ResNet, where $L$ is the number of residual blocks.
no code implementations • 29 May 2019 • Shicong Cen, Huishuai Zhang, Yuejie Chi, Wei Chen, Tie-Yan Liu
Our theory captures how the convergence of distributed algorithms behaves as the number of machines and the size of local data vary.
no code implementations • ICLR 2019 • Qi Meng, Shuxin Zheng, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
Then, a natural question is: \emph{can we construct a new vector space that is positively scale-invariant and sufficient to represent ReLU neural networks so as to better facilitate the optimization process }?
no code implementations • ICLR 2019 • Mingyang Yi, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
Optimization on manifold has been widely used in machine learning, to handle optimization problems with constraint.
1 code implementation • 17 Mar 2019 • Huishuai Zhang, Da Yu, Mingyang Yi, Wei Chen, Tie-Yan Liu
Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
no code implementations • ICLR 2019 • Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, Vahid Tarokh
Stochastic gradient descent (SGD) has been found to be surprisingly effective in training a variety of deep neural networks.
no code implementations • NeurIPS 2018 • Huishuai Zhang, Wei Chen, Tie-Yan Liu
We study the Hessian of the local back-matching loss (local Hessian) and connect it to the efficiency of BP.
no code implementations • 19 Sep 2018 • Shuxin Zheng, Qi Meng, Huishuai Zhang, Wei Chen, Nenghai Yu, Tie-Yan Liu
Motivated by this, we propose a new norm \emph{Basis-path Norm} based on a group of linearly independent paths to measure the capacity of neural networks more accurately.
no code implementations • 27 Feb 2018 • Huishuai Zhang, Wei Chen, Tie-Yan Liu
This inconsistence of gradient magnitude across different layers renders optimization of deep neural network with a single learning rate problematic.
no code implementations • 19 Feb 2018 • Yi Zhou, Yingbin Liang, Huishuai Zhang
With strongly convex regularizers, we further establish the generalization error bounds for nonconvex loss functions under proximal SGD with high-probability guarantee, i. e., exponential concentration in probability.
no code implementations • 11 Feb 2018 • Qi Meng, Shuxin Zheng, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, Tie-Yan Liu
Then, a natural question is: \emph{can we construct a new vector space that is positively scale-invariant and sufficient to represent ReLU neural networks so as to better facilitate the optimization process }?
no code implementations • ICLR 2018 • Huishuai Zhang, Caiming Xiong, James Bradbury, Richard Socher
Second-order methods for neural network optimization have several advantages over methods based on first-order gradient descent, including better scaling to large mini-batch sizes and fewer updates needed for convergence.
no code implementations • 23 Sep 2017 • Yuanxin Li, Yuejie Chi, Huishuai Zhang, Yingbin Liang
Recent work has demonstrated the effectiveness of gradient descent for directly recovering the factors of low-rank matrices from random linear measurements in a globally convergent manner when initialized properly.
no code implementations • NeurIPS 2016 • Huishuai Zhang, Yingbin Liang
In contrast to the smooth loss function used in WF, we adopt a nonsmooth but lower-order loss function, and design a gradient-like algorithm (referred to as reshaped-WF).
1 code implementation • 25 May 2016 • Huishuai Zhang, Yi Zhou, Yingbin Liang, Yuejie Chi
We further develop the incremental (stochastic) reshaped Wirtinger flow (IRWF) and show that IRWF converges linearly to the true signal.
no code implementations • 11 Mar 2016 • Huishuai Zhang, Yuejie Chi, Yingbin Liang
This paper investigates the phase retrieval problem, which aims to recover a signal from the magnitudes of its linear measurements.
no code implementations • NeurIPS 2015 • Huishuai Zhang, Yi Zhou, Yingbin Liang
We investigate the robust PCA problem of decomposing an observed matrix into the sum of a low-rank and a sparse error matrices via convex programming Principal Component Pursuit (PCP).