Search Results for author: Xiyang Dai

Found 54 papers, 28 papers with code

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

no code implementations6 Jun 2024 Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, Yu-Gang Jiang

The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer.

Language Modelling Large Language Model

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

1 code implementation22 Apr 2024 Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Jianwei Yang, ZiYi Yang, Yifan Yang, Donghan Yu, Lu Yuan, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou

We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.

Language Modelling

Efficient Modulation for Vision Networks

1 code implementation29 Mar 2024 Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, Lu Yuan

We demonstrate that the modulation mechanism is particularly well suited for efficient networks and further tailor the modulation design by proposing the efficient modulation (EfficientMod) block, which is considered the essential building block for our networks.

Rewrite the Stars

2 code implementations CVPR 2024 Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, Yun Fu

Recent studies have drawn attention to the untapped potential of the "star operation" (element-wise multiplication) in network design.

Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search

1 code implementation15 Mar 2024 Hongyuan Yu, Cheng Wan, Mengchen Liu, Dongdong Chen, Bin Xiao, Xiyang Dai

Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution.

Autonomous Driving Image Segmentation +2

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

no code implementations CVPR 2024 Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.

Multi-Task Learning object-detection +1

On the Hidden Waves of Image

no code implementations19 Oct 2023 Yinpeng Chen, Dongdong Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Zicheng Liu, Youzuo Lin

We term this phenomenon hidden waves, as it reveals that, although the speeds of the set of wave equations and autoregressive coefficient matrices are latent, they are both learnable and shared across images.

LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following

1 code implementation18 Oct 2023 Cheng-Fu Yang, Yen-Chun Chen, Jianwei Yang, Xiyang Dai, Lu Yuan, Yu-Chiang Frank Wang, Kai-Wei Chang

Additional analysis shows that the contrastive objective and meta-actions are complementary in achieving the best results, and the resulting agent better aligns its states with corresponding instructions, making it more suitable for real-world embodied agents.

Contrastive Learning Instruction Following

Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical Invariance

no code implementations25 May 2023 Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Youzuo Lin

This paper introduces a novel mathematical property applicable to diverse images, referred to as FINOLA (First-Order Norm+Linear Autoregressive).

Decoder Image Classification +4

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

no code implementations27 Apr 2023 Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios.

Video Understanding

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

no code implementations CVPR 2023 Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Chuanxin Tang, Xiyang Dai, Yucheng Zhao, Yujia Xie, Lu Yuan, Yu-Gang Jiang

Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.

Instance Segmentation Segmentation +3

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

4 code implementations CVPR 2023 Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang

For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.

Action Classification Representation Learning +1

Self-Supervised Learning based on Heat Equation

no code implementations23 Nov 2022 Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Youzuo Lin

When transferring to object detection with frozen backbone, QB-Heat outperforms MoCo-v2 and supervised pre-training on ImageNet by 7. 9 and 4. 5 AP respectively.

Image Classification object-detection +2

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

no code implementations25 Aug 2022 Rui Wang, Zuxuan Wu, Dongdong Chen, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Luowei Zhou, Lu Yuan, Yu-Gang Jiang

To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e. g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism.

Video Recognition

Should All Proposals be Treated Equally in Object Detection?

1 code implementation7 Jul 2022 Yunsheng Li, Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Pei Yu, Jing Yin, Lu Yuan, Zicheng Liu, Nuno Vasconcelos

We formulate this as a learning problem where the goal is to assign operators to proposals, in the detection head, so that the total computational cost is constrained and the precision is maximized.

Object Object Detection

GLIPv2: Unifying Localization and Vision-Language Understanding

1 code implementation12 Jun 2022 Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g., VQA, image captioning).

 Ranked #1 on Phrase Grounding on Flickr30k Entities Test (using extra training data)

Contrastive Learning Image Captioning +7

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

no code implementations3 Jun 2022 Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, Michael Zeng

Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image (e. g., image tags, object attributes / locations, captions) as a structured textual prompt, called visual clues, using a vision foundation model.

Image Paragraph Captioning Language Modelling +1

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code implementations22 Apr 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

Question Answering Visual Commonsense Reasoning +2

Residual Mixture of Experts

no code implementations20 Apr 2022 Lemeng Wu, Mengchen Liu, Yinpeng Chen, Dongdong Chen, Xiyang Dai, Lu Yuan

In this paper, we propose Residual Mixture of Experts (RMoE), an efficient training pipeline for MoE vision transformers on downstream tasks, such as segmentation and detection.

object-detection Object Detection

Focal Modulation Networks

6 code implementations22 Mar 2022 Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao

For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2. 4, and beats Swin at multi-scale (50. 5 v. s.

Ranked #8 on Object Detection on COCO minival (using extra training data)

Image Classification Object Detection +2

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code implementations15 Jan 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

Question Answering Visual Commonsense Reasoning +2

RegionCLIP: Region-based Language-Image Pretraining

1 code implementation CVPR 2022 Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.

Ranked #12 on Open Vocabulary Object Detection on MSCOCO (using extra training data)

Image Classification Object +3

BEVT: BERT Pretraining of Video Transformers

1 code implementation CVPR 2022 Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, Lu Yuan

This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i. e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations.

Action Recognition Representation Learning

Focal Attention for Long-Range Interactions in Vision Transformers

1 code implementation NeurIPS 2021 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

With focal attention, we propose a new variant of Vision Transformer models, called Focal Transformers, which achieve superior performance over the state-of-the-art (SoTA) Vision Transformers on a range of public image classification and object detection benchmarks.

Image Classification object-detection +2

Florence: A New Foundation Model for Computer Vision

1 code implementation22 Nov 2021 Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang

Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.

Action Classification Action Recognition In Videos +12

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

no code implementations19 Nov 2021 JianFeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.

Image Captioning Image-text matching +9

MicroNet: Improving Image Recognition with Extremely Low FLOPs

1 code implementation ICCV 2021 Yunsheng Li, Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Lei Zhang, Nuno Vasconcelos

This paper aims at addressing the problem of substantial performance degradation at extremely low computational cost (e. g. 5M FLOPs on ImageNet classification).

Focal Self-attention for Local-Global Interactions in Vision Transformers

3 code implementations1 Jul 2021 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks.

Image Classification Instance Segmentation +3

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

3 code implementations ICCV 2021 Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques.

Image Classification Instance Segmentation +2

CvT: Introducing Convolutions to Vision Transformers

14 code implementations ICCV 2021 Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs.

Ranked #3 on Image Classification on Flowers-102 (using extra training data)

Image Classification

Revisiting Dynamic Convolution via Matrix Decomposition

1 code implementation ICLR 2021 Yunsheng Li, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Ye Yu, Lu Yuan, Zicheng Liu, Mei Chen, Nuno Vasconcelos

It has two limitations: (a) it increases the number of convolutional weights by K-times, and (b) the joint optimization of dynamic attention and static convolution kernels is challenging.

Dimensionality Reduction

Stronger NAS with Weaker Predictors

1 code implementation NeurIPS 2021 Junru Wu, Xiyang Dai, Dongdong Chen, Yinpeng Chen, Mengchen Liu, Ye Yu, Zhangyang Wang, Zicheng Liu, Mei Chen, Lu Yuan

We propose a paradigm shift from fitting the whole architecture space using one strong predictor, to progressively fitting a search path towards the high-performance sub-space through a set of weaker predictors.

Neural Architecture Search

Dynamic DETR: End-to-End Object Detection With Dynamic Attention

no code implementations ICCV 2021 Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, Lei Zhang

To mitigate the second limitation of learning difficulty, we introduce a dynamic decoder by replacing the cross-attention module with a ROI-based dynamic attention in the Transformer decoder.

Decoder object-detection +1

Weak NAS Predictor Is All You Need

no code implementations1 Jan 2021 Junru Wu, Xiyang Dai, Dongdong Chen, Yinpeng Chen, Mengchen Liu, Ye Yu, Zhangyang Wang, Zicheng Liu, Mei Chen, Lu Yuan

Rather than expecting a single strong predictor to model the whole space, we seek a progressive line of weak predictors that can connect a path to the best architecture, thus greatly simplifying the learning task of each predictor.

Neural Architecture Search

MicroNet: Towards Image Recognition with Extremely Low FLOPs

no code implementations24 Nov 2020 Yunsheng Li, Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Lei Zhang, Nuno Vasconcelos

In this paper, we present MicroNet, which is an efficient convolutional neural network using extremely low computational cost (e. g. 6 MFLOPs on ImageNet classification).

DA-NAS: Data Adapted Pruning for Efficient Neural Architecture Search

no code implementations ECCV 2020 Xiyang Dai, Dong-Dong Chen, Mengchen Liu, Yinpeng Chen, Lu Yuan

One common way is searching on a smaller proxy dataset (e. g., CIFAR-10) and then transferring to the target task (e. g., ImageNet).

Neural Architecture Search

Dynamic ReLU

2 code implementations ECCV 2020 Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dong-Dong Chen, Lu Yuan, Zicheng Liu

Rectified linear units (ReLU) are commonly used in deep neural networks.

Dynamic Convolution: Attention over Convolution Kernels

5 code implementations CVPR 2020 Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dong-Dong Chen, Lu Yuan, Zicheng Liu

Light-weight convolutional neural networks (CNNs) suffer performance degradation as their low computational budgets constrain both the depth (number of convolution layers) and the width (number of channels) of CNNs, resulting in limited representation capability.

Image Classification Keypoint Detection

TAN: Temporal Aggregation Network for Dense Multi-label Action Recognition

no code implementations14 Dec 2018 Xiyang Dai, Bharat Singh, Joe Yue-Hei Ng, Larry S. Davis

We present Temporal Aggregation Network (TAN) which decomposes 3D convolutions into spatial and temporal aggregation blocks.

Action Recognition Temporal Action Localization

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

no code implementations CVPR 2019 Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, Larry S. Davis

In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network.

Moment Retrieval Natural Language Moment Retrieval +1

Dynamic Temporal Pyramid Network: A Closer Look at Multi-Scale Modeling for Activity Detection

no code implementations7 Aug 2018 Da Zhang, Xiyang Dai, Yuan-Fang Wang

(3) We further exploit the temporal context of activities by appropriately fusing multi-scale feature maps, and demonstrate that both local and global temporal contexts are important.

Action Detection Activity Detection +2

S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Networks

1 code implementation21 Jul 2018 Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang

In this paper, we present a novel Single Shot multi-Span Detector for temporal activity detection in long, untrimmed videos using a simple end-to-end fully three-dimensional convolutional (Conv3D) network.

Action Detection Activity Detection

Deep Motion Boundary Detection

no code implementations13 Apr 2018 Xiaoqing Yin, Xiyang Dai, Xinchao Wang, Maojun Zhang, DaCheng Tao, Larry Davis

In this paper, we propose the first dedicated end-to-end deep learning approach for motion boundary detection, which we term as MoBoNet.

Boundary Detection Optical Flow Estimation

Temporal Context Network for Activity Localization in Videos

no code implementations ICCV 2017 Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, Yan Qiu Chen

For each temporal segment inside a proposal, features are uniformly sampled at a pair of scales and are input to a temporal convolutional neural network for classification.

General Classification Temporal Localization

FASON: First and Second Order Information Fusion Network for Texture Recognition

no code implementations CVPR 2017 Xiyang Dai, Joe Yue-Hei Ng, Larry S. Davis

We then build a multi-level deep architecture to exploit the first and second order information within different convolutional layers.

Cannot find the paper you are looking for? You can Submit a new open access paper.