2 code implementations • 5 Dec 2024 • Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu
This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy.
Ranked #5 on
Video Question Answering
on NExT-QA
no code implementations • 5 Dec 2024 • An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, Xiaolong Wang
This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes.
no code implementations • 19 Nov 2024 • Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myronenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yee Man Law, Yucheng Tang, Pengfei Guo, Can Zhao, Ziyue Xu, Yufan He, Greg Heinrich, Stephen Aylward, Marc Edgar, Michael Zephyr, Pavlo Molchanov, Baris Turkbey, Holger Roth, Daguang Xu
In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models.
no code implementations • 28 Oct 2024 • Shih-Yang Liu, Huck Yang, Chien-Yi Wang, Nai Chit Fung, Hongxu Yin, Charbel Sakr, Saurav Muralidharan, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
In this work, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users (e. g., tasks, compression ratios), resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats.
1 code implementation • 26 Sep 2024 • Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang
This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks.
1 code implementation • 6 Sep 2024 • Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation.
2 code implementations • 28 Aug 2024 • Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu
The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs).
1 code implementation • 19 Aug 2024 • Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han
We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing.
Ranked #8 on
Video Question Answering
on NExT-QA
no code implementations • 24 Jul 2024 • Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin
In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data.
Ranked #63 on
Visual Question Answering
on MM-Vet
no code implementations • 11 Jun 2024 • Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov
Training modern LLMs is extremely resource intensive, and customizing them for various deployment scenarios characterized by limited compute and memory resources through repeated training is impractical.
no code implementations • 6 Jun 2024 • Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jose M. Alvarez
First, a novel feature regularization (FeatReg) to retain and refine knowledge from existing checkpoints; Second, we propose adaptive knowledge distillation (AdaKD), a novel approach to forget mitigation and knowledge transfer.
1 code implementation • 3 Jun 2024 • An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks.
no code implementations • 29 May 2024 • Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin
We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.
1 code implementation • 27 Mar 2024 • De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz
In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task.
no code implementations • CVPR 2024 • Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu
Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions.
5 code implementations • 14 Feb 2024 • Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen
By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead.
Ranked #2 on
parameter-efficient fine-tuning
on WinoGrande
(using extra training data)
3 code implementations • CVPR 2024 • Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han
Visual language models (VLMs) rapidly progressed with the recent success of large language models.
Ranked #4 on
Zero-Shot Video Question Answer
on MSVD-QA
no code implementations • 2 Oct 2023 • Jingwei Sun, Ziyue Xu, Hongxu Yin, Dong Yang, Daguang Xu, Yiran Chen, Holger R. Roth
However, applying FL to finetune PLMs is hampered by challenges, including restricted model parameter access, high computational requirements, and communication overheads.
no code implementations • 29 Aug 2023 • Yazhou Xing, Amrita Mazumdar, Anjul Patney, Chao Liu, Hongxu Yin, Qifeng Chen, Jan Kautz, Iuri Frosio
We present a learning-based system to reduce these artifacts without resorting to complex acquisition mechanisms like alternating exposures or costly processing that are typical of high dynamic range (HDR) imaging.
no code implementations • 25 Jun 2023 • Anna Bair, Hongxu Yin, Maying Shen, Pavlo Molchanov, Jose Alvarez
Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world.
no code implementations • CVPR 2023 • Divyam Madaan, Hongxu Yin, Wonmin Byeon, Jan Kautz, Pavlo Molchanov
We propose a novel framework and a solution to tackle the continual learning (CL) problem with changing network architectures.
2 code implementations • 9 Jun 2023 • Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov
At a high level, global self-attentions enable the efficient cross-window communication at lower costs.
Ranked #188 on
Image Classification
on ImageNet
1 code implementation • CVPR 2023 • Paul Micaelli, Arash Vahdat, Hongxu Yin, Jan Kautz, Pavlo Molchanov
Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the challenging WFLW facial landmark dataset, reaching $3. 92$ NME with fewer parameters and a training memory cost of $\mathcal{O}(1)$ in the number of recurrent modules.
Ranked #2 on
Face Alignment
on WFLW
1 code implementation • 13 Oct 2022 • Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez
We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget on targeting device.
8 code implementations • 20 Jun 2022 • Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov
Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently.
Ranked #135 on
Semantic Segmentation
on ADE20K
no code implementations • CVPR 2022 • Ali Hatamizadeh, Hongxu Yin, Holger Roth, Wenqi Li, Jan Kautz, Daguang Xu, Pavlo Molchanov
In this work we demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks.
no code implementations • 14 Feb 2022 • Ali Hatamizadeh, Hongxu Yin, Pavlo Molchanov, Andriy Myronenko, Wenqi Li, Prerna Dogra, Andrew Feng, Mona G. Flores, Jan Kautz, Daguang Xu, Holger R. Roth
Federated learning (FL) allows the collaborative training of AI models without needing to share raw data.
1 code implementation • CVPR 2022 • Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, Pavlo Molchanov
A-ViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
Ranked #34 on
Efficient ViTs
on ImageNet-1K (with DeiT-S)
no code implementations • CVPR 2022 • Maying Shen, Pavlo Molchanov, Hongxu Yin, Jose M. Alvarez
Through extensive experiments on ImageNet, we show that EPI empowers a quick tracking of early training epochs suitable for pruning, offering same efficacy as an otherwise ``oracle'' grid-search that scans through epochs and requires orders of magnitude more compute.
1 code implementation • 20 Oct 2021 • Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez
We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget.
1 code implementation • CVPR 2023 • Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, Jan Kautz
This work aims on challenging the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage, where we redistribute the parameters both across transformer blocks and between different structures within the block via the first systematic attempt on global structural pruning.
no code implementations • 29 Sep 2021 • Pavlo Molchanov, Jimmy Hall, Hongxu Yin, Jan Kautz, Nicolo Fusi, Arash Vahdat
In the second phase, it solves the combinatorial selection of efficient operations using a novel constrained integer linear optimization approach.
no code implementations • 13 Jul 2021 • Xin Dong, Hongxu Yin, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov, H. T. Kung
Prior works usually assume that SC offers privacy benefits as only intermediate features, instead of private data, are shared from devices to the cloud.
no code implementations • 12 Jul 2021 • Pavlo Molchanov, Jimmy Hall, Hongxu Yin, Jan Kautz, Nicolo Fusi, Arash Vahdat
We analyze three popular network architectures: EfficientNetV1, EfficientNetV2 and ResNeST, and achieve accuracy improvement for all models (up to $3. 0\%$) when compressing larger models to the latency level of smaller models.
no code implementations • CVPR 2021 • Yerlan Idelbayev, Pavlo Molchanov, Maying Shen, Hongxu Yin, Miguel A. Carreira-Perpinan, Jose M. Alvarez
We study the problem of quantizing N sorted, scalar datapoints with a fixed codebook containing K entries that are allowed to be rescaled.
1 code implementation • CVPR 2021 • Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov
In this work, we introduce GradInversion, using which input images from a larger batch (8 - 48 images) can also be recovered for large networks such as ResNets (50 layers), on complex datasets such as ImageNet (1000 classes, 224x224 px).
no code implementations • 20 Feb 2021 • Shayan Hassantabar, Joe Zhang, Hongxu Yin, Niraj K. Jha
At the patient level, MHDeep DNNs achieve an accuracy of 100%, 100%, and 90. 0% for the three mental health disorders, respectively.
no code implementations • 29 Jul 2020 • Wenhan Xia, Hongxu Yin, Xiaoliang Dai, Niraj K. Jha
Modern deep neural networks are powerful and widely applicable models that extract task-relevant information through multi-level abstraction.
no code implementations • 18 Apr 2020 • Wenhan Xia, Hongxu Yin, Niraj K. Jha
These large, deep models are often unsuitable for real-world applications, due to their massive computational cost, high memory bandwidth, and long latency.
2 code implementations • CVPR 2020 • Hongxu Yin, Pavlo Molchanov, Zhizhong Li, Jose M. Alvarez, Arun Mallya, Derek Hoiem, Niraj K. Jha, Jan Kautz
We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network.
no code implementations • 11 Oct 2019 • Hongxu Yin, Bilal Mukadam, Xiaoliang Dai, Niraj K. Jha
For server (edge) side inference, we achieve a 96. 3% (95. 3%) accuracy in classifying diabetics against healthy individuals, and a 95. 7% (94. 6%) accuracy in distinguishing among type-1/type-2 diabetic, and healthy individuals.
no code implementations • 27 May 2019 • Xiaoliang Dai, Hongxu Yin, Niraj K. Jha
Deep neural networks (DNNs) have become a widely deployed model for numerous machine learning applications.
no code implementations • 30 Jan 2019 • Hongxu Yin, Guoyang Chen, Yingmin Li, Shuai Che, Weifeng Zhang, Niraj K. Jha
In this work, we propose a hardware-guided symbiotic training methodology for compact, accurate, yet execution-efficient inference models.
1 code implementation • CVPR 2019 • Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, Niraj K. Jha
We formulate platform-aware NN architecture search in an optimization framework and propose a novel algorithm to search for optimal architectures aided by efficient accuracy and resource (latency and/or energy) predictors.
no code implementations • 30 May 2018 • Xiaoliang Dai, Hongxu Yin, Niraj K. Jha
To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM's original one level non-linear control gates.
no code implementations • 6 Nov 2017 • Xiaoliang Dai, Hongxu Yin, Niraj K. Jha
To address these problems, we introduce a network growth algorithm that complements network pruning to learn both weights and compact DNN architectures during training.