no code implementations • 30 Sep 2024 • Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, BoWen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, ZiRui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning.
Ranked #55 on Visual Question Answering on MM-Vet
no code implementations • 29 Jul 2024 • Tom Gunter, ZiRui Wang, Chong Wang, Ruoming Pang, Aonan Zhang, BoWen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren
We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute.
1 code implementation • 23 May 2024 • Xianzhi Du, Tom Gunter, Xiang Kong, Mark Lee, ZiRui Wang, Aonan Zhang, Nan Du, Ruoming Pang
In this work, we revisit the settings by adopting step time as a more accurate measure of model complexity, and by determining the total compute budget under the Chinchilla compute-optimal settings.
no code implementations • 14 Mar 2024 • Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, BoWen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, ZiRui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons.
Ranked #72 on Visual Question Answering on MM-Vet
2 code implementations • 11 Oct 2023 • Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, BoWen Zhang, ZiRui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions.
1 code implementation • 11 Oct 2023 • Zhengfeng Lai, Haotian Zhang, BoWen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, Meng Cao
For example, VeCLIP achieves up to +25. 2% gain in COCO and Flickr30k retrieval tasks under the 12M setting.
1 code implementation • 2 Oct 2023 • Ajay Jaiswal, Zhe Gan, Xianzhi Du, BoWen Zhang, Zhangyang Wang, Yinfei Yang
Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline.
2 code implementations • 29 Sep 2023 • Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan
Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
no code implementations • 8 Sep 2023 • Erik Daxberger, Floris Weers, BoWen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
1 code implementation • 13 Jun 2023 • Wentao Wu, Aleksei Timofeev, Chen Chen, BoWen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathon Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang
Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image.
1 code implementation • 8 May 2023 • Liangliang Cao, BoWen Zhang, Chen Chen, Yinfei Yang, Xianzhi Du, Wencong Zhang, Zhiyun Lu, Yantao Zheng
In this paper, we discuss two effective approaches to improve the efficiency and robustness of CLIP training: (1) augmenting the training dataset while maintaining the same number of optimization steps, and (2) filtering out samples that contain text regions in the image.
1 code implementation • ICCV 2023 • Tianlong Chen, Xuxi Chen, Xianzhi Du, Abdullah Rashwan, Fan Yang, Huizhong Chen, Zhangyang Wang, Yeqing Li
Instead of compressing multiple tasks' knowledge into a single model, MoE separates the parameter space and only utilizes the relevant model pieces given task type and its input, which provides stabilized MTL training and ultra-efficient inference.
no code implementations • 11 Aug 2022 • Xianzhi Du, Wei-Chih Hung, Tsung-Yi Lin
This paper summarizes model improvements and inference-time optimizations for the popular anchor-based detectors in the scenes of autonomous driving.
1 code implementation • ICLR 2022 • Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou
The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart.
1 code implementation • 24 Feb 2022 • Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, Tianbao Yang
In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary of feature vectors.
3 code implementations • 17 Dec 2021 • Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou
In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy.
no code implementations • 14 Dec 2021 • Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown
The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.
5 code implementations • 3 Sep 2021 • Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, Irwan Bello
A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.
Ranked #40 on Action Classification on Kinetics-600
1 code implementation • 30 Jun 2021 • Xianzhi Du, Barret Zoph, Wei-Chih Hung, Tsung-Yi Lin
We benchmark these improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors.
Ranked #57 on Object Detection on COCO minival
no code implementations • 23 Mar 2021 • Abdullah Rashwan, Xianzhi Du, Xiaoqi Yin, Jing Li
Scale-permuted networks have shown promising results on object bounding box detection and instance segmentation.
Ranked #5 on Semantic Segmentation on PASCAL VOC 2012 val
3 code implementations • NeurIPS 2021 • Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph
Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1. 7x - 2. 7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.
no code implementations • ECCV 2020 • Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Yin Cui, Mingxing Tan, Quoc Le, Xiaodan Song
Furthermore, SpineNet is built with a uniform resource distribution over operations.
13 code implementations • CVPR 2020 • Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song
We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Ranked #14 on Image Classification on iNaturalist
no code implementations • 11 Jun 2019 • Mostafa El-Khamy, Haoyu Ren, Xianzhi Du, Jungwon Lee
In this paper, we introduce the problem of estimating the real world depth of elements in a scene captured by two cameras with different field of views, where the first field of view (FOV) is a Wide FOV (WFOV) captured by a wide angle lens, and the second FOV is contained in the first FOV and is captured by a tele zoom lens.
no code implementations • 19 Apr 2019 • Xianzhi Du, Mostafa El-Khamy, Jungwon Lee
A stacked atrous multiscale network is proposed to aggregate rich multiscale contextual information from the cost volume which allows for estimating the disparity with high accuracy at multiple scales.
no code implementations • 2 May 2018 • Xianzhi Du, Mostafa El-Khamy, Vlad I. Morariu, Jungwon Lee, Larry Davis
The classification system further classifies the generated candidates based on opinions of multiple deep verification networks and a fusion network which utilizes a novel soft-rejection fusion method to adjust the confidence in the detection results.
no code implementations • 22 Dec 2017 • Xianzhi Du, Xiaolong Wang, Dawei Li, Jingwen Zhu, Serafettin Tasci, Cameron Upright, Stephen Walsh, Larry Davis
Compared to the general semantic segmentation problem, portrait segmentation has higher precision requirement on boundary area.
no code implementations • 11 Oct 2016 • Xianzhi Du, Mostafa El-Khamy, Jungwon Lee, Larry S. Davis
A single shot deep convolutional network is trained as a object detector to generate all possible pedestrian candidates of different sizes and occlusions.
Ranked #21 on Pedestrian Detection on Caltech
no code implementations • CVPR 2015 • Xianzhi Du, David Doermann, Wael Abd-Almageed
In this paper, we present a novel partial signature matching method using graphical models.