Search Results for author: Zuxuan Wu

Found 100 papers, 47 papers with code

Adaptive Rentention & Correction for Continual Learning

no code implementations • 23 May 2024 • Haoran Chen, Micah Goldblum, Zuxuan Wu, Yu-Gang Jiang

A common problem in continual learning is the classification layer's bias towards the most recent task.

Paper
Add Code

PoseAnimate: Zero-shot high fidelity pose controllable character animation

no code implementations • 21 Apr 2024 • Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Yu-Gang Jiang, Guo-Jun Qi

Image-to-video(I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity with the source image. However, existing approaches suffer from character appearance inconsistency and poor preservation of fine details.

Paper
Add Code

Learning to Rank Patches for Unbiased Image Redundancy Reduction

1 code implementation • 31 Mar 2024 • Yang Luo, Zhineng Chen, Peng Zhou, Zuxuan Wu, Xieping Gao, Yu-Gang Jiang

The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content.

Image Reconstruction Inductive Bias +1

Paper
Code

OmniVid: A Generative Framework for Universal Video Understanding

1 code implementation • 26 Mar 2024 • Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

Action Recognition Decoder +5

Paper
Code

FDGaussian: Fast Gaussian Splatting from Single Image via Geometric-aware Diffusion Model

no code implementations • 15 Mar 2024 • Qijun Feng, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang

Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available.

3D Reconstruction

Paper
Add Code

MouSi: Poly-Visual-Expert Vision-Language Models

1 code implementation • 30 Jan 2024 • Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang

This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs.

Ranked #43 on Visual Question Answering on MM-Vet

Image Segmentation Image-text matching +4

Paper
Code

Secrets of RLHF in Large Language Models Part II: Reward Modeling

1 code implementation • 11 Jan 2024 • Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang

We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data.

Contrastive Learning Meta-Learning +1

1,187

Paper
Code

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

no code implementations • 4 Dec 2023 • Zhenxin Li, Shiyi Lan, Jose M. Alvarez, Zuxuan Wu

Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection.

3D Object Detection Decoder +4

Paper
Add Code

MotionEditor: Editing Video Motion via Content-Aware Diffusion

1 code implementation • 30 Nov 2023 • Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang

This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance.

Video Editing

Paper
Code

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

no code implementations • 30 Nov 2023 • Zhen Xing, Qi Dai, Zihao Zhang, HUI ZHANG, Han Hu, Zuxuan Wu, Yu-Gang Jiang

Our model can edit and translate the desired results within seconds based on user instructions.

Semantic Segmentation Video Editing +3

Paper
Add Code

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

1 code implementation • 30 Nov 2023 • Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu

With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance.

Attribute Compositional Zero-Shot Learning

Paper
Code

VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model

1 code implementation • 29 Nov 2023 • Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Zuxuan Wu, Hang Xu, Yu-Gang Jiang

Identity-consistent video generation seeks to synthesize videos that are guided by both textual prompts and reference images of entities.

Ranked #1 on Video Generation on MSR-VTT

Denoising Image to Video Generation +1

Paper
Code

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

1 code implementation • 24 Nov 2023 • Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang

In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target.

Meta-Learning One-Shot Segmentation +3

Paper
Code

AdaDiff: Adaptive Step Selection for Fast Diffusion

no code implementations • 24 Nov 2023 • HUI ZHANG, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang

Diffusion models, as a type of generative models, have achieved impressive results in generating images and videos conditioned on textual conditions.

Denoising Image Generation +1

Paper
Add Code

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

2 code implementations • 13 Nov 2023 • Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data.

Ranked #36 on Visual Question Answering on MM-Vet

Instruction Following Visual Question Answering

197

Paper
Code

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models

no code implementations • 25 Oct 2023 • Tianyi Lu, Xing Zhang, Jiaxi Gu, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu

In this way, temporal consistency can be kept with video LDM while high-fidelity from the image LDM can also be exploited.

Denoising Video Editing

Paper
Add Code

A Survey on Video Diffusion Models

1 code implementation • 16 Oct 2023 • Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

However, existing surveys mainly focus on diffusion models in the context of image generation, with few up-to-date reviews on their application in the video domain.

Image Generation Video Editing +2

1,385

Paper
Code

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

1 code implementation • 8 Oct 2023 • Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.

Action Recognition Continual Learning +5

Paper
Code

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

no code implementations • 7 Sep 2023 • Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei zhang, Yu-Gang Jiang, Hang Xu

Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process.

Action Recognition Decoder +4

Paper
Add Code

SimDA: Simple Diffusion Adapter for Efficient Video Generation

no code implementations • 18 Aug 2023 • Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang

In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1. 1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.

Transfer Learning Video Editing +2

Paper
Add Code

On the Importance of Spatial Relations for Few-shot Action Recognition

no code implementations • 14 Aug 2023 • Yilun Zhang, Yuqian Fu, Xingjun Ma, Lizhe Qi, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang

We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information.

Few-Shot action recognition Few Shot Action Recognition +1

Paper
Add Code

Prompting Large Language Models to Reformulate Queries for Moment Localization

no code implementations • 6 Jun 2023 • Wenfeng Yan, Shaoxiang Chen, Zuxuan Wu, Yu-Gang Jiang

The task of moment localization is to localize a temporal moment in an untrimmed video for a given natural language query.

Moment Queries Natural Language Queries

Paper
Add Code

BMB: Balanced Memory Bank for Imbalanced Semi-supervised Learning

no code implementations • 22 May 2023 • Wujian Peng, Zejia Weng, Hengduo Li, Zuxuan Wu

Exploring a substantial amount of unlabeled data, semi-supervised learning (SSL) boosts the recognition performance when only a limited number of labels are provided.

Paper
Add Code

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

no code implementations • 27 Apr 2023 • Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios.

Video Understanding

Paper
Add Code

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

1 code implementation • ICCV 2023 • Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, Yu-Gang Jiang

While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention.

Ranked #16 on Action Classification on Kinetics-400

Action Classification Action Recognition +1

Paper
Code

Towards Scalable Neural Representation for Diverse Videos

no code implementations • CVPR 2023 • Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava

Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e. g., NeRV, E-NeRV).

Action Recognition Video Compression

Paper
Add Code

OmniTracker: Unifying Object Tracking by Tracking-with-Detection

no code implementations • 21 Mar 2023 • Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Xiyang Dai, Lu Yuan, Yu-Gang Jiang

Object tracking (OT) aims to estimate the positions of target objects in a video sequence.

Object Object Tracking

Paper
Add Code

DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly Detection

1 code implementation • 15 Mar 2023 • HUI ZHANG, Zheng Wang, Zuxuan Wu, Yu-Gang Jiang

Anomaly detection has garnered extensive applications in real industrial manufacturing due to its remarkable effectiveness and efficiency.

Ranked #1 on Anomaly Detection on VisA

Denoising Unsupervised Anomaly Detection

121

Paper
Code

PromptFusion: Decoupling Stability and Plasticity for Continual Learning

no code implementations • 13 Mar 2023 • Haoran Chen, Zuxuan Wu, Xintong Han, Menglin Jia, Yu-Gang Jiang

Such a trade-off is referred to as the stabilityplasticity dilemma and is a more general and challenging problem for continual learning.

Class Incremental Learning Incremental Learning

Paper
Add Code

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

1 code implementation • 1 Feb 2023 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization.

Action Recognition Continual Learning +2

Paper
Code

Vision Transformers Are Good Mask Auto-Labelers

no code implementations • CVPR 2023 • Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar

We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations.

Instance Segmentation Segmentation +1

Paper
Add Code

Resolving Task Confusion in Dynamic Expansion Architectures for Class Incremental Learning

1 code implementation • 29 Dec 2022 • Bingchen Huang, Zhineng Chen, Peng Zhou, Jiayin Chen, Zuxuan Wu

The dynamic expansion architecture is becoming popular in class incremental learning, mainly due to its advantages in alleviating catastrophic forgetting.

Ranked #1 on Incremental Learning on CIFAR-100 - 50 classes + 2 steps of 25 classes

Class Incremental Learning Incremental Learning +2

Paper
Code

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

no code implementations • CVPR 2023 • Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Chuanxin Tang, Xiyang Dai, Yucheng Zhao, Yujia Xie, Lu Yuan, Yu-Gang Jiang

Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.

Ranked #1 on Semi-Supervised Video Object Segmentation on Long Video Dataset (using extra training data)

Instance Segmentation Segmentation +3

Paper
Add Code

Fighting Malicious Media Data: A Survey on Tampering Detection and Deepfake Detection

no code implementations • 12 Dec 2022 • Junke Wang, Zhenxin Li, Chao Zhang, Jingjing Chen, Zuxuan Wu, Larry S. Davis, Yu-Gang Jiang

Online media data, in the forms of images and videos, are becoming mainstream communication channels.

DeepFake Detection Face Swapping

Paper
Add Code

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

4 code implementations • CVPR 2023 • Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang

For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.

Ranked #1 on Self-Supervised Action Recognition on HMDB51

Action Classification Representation Learning +1

Paper
Code

Prototypical Residual Networks for Anomaly Detection and Localization

no code implementations • CVPR 2023 • HUI ZHANG, Zuxuan Wu, Zheng Wang, Zhineng Chen, Yu-Gang Jiang

Anomaly detection and localization are widely used in industrial manufacturing for its efficiency and effectiveness.

Ranked #2 on Supervised Anomaly Detection on MVTec AD (using extra training data)

Supervised Anomaly Detection

Paper
Add Code

ResFormer: Scaling ViTs with Multi-Resolution Training

1 code implementation • CVPR 2023 • Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang

We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions.

Action Recognition Image Classification +4

Paper
Code

SVFormer: Semi-supervised Video Transformer for Action Recognition

1 code implementation • CVPR 2023 • Zhen Xing, Qi Dai, Han Hu, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang

In this paper, we investigate the use of transformer models under the SSL setting for action recognition.

Action Recognition Semi-Supervised Image Classification +1

Paper
Code

Semi-Supervised Single-View 3D Reconstruction via Prototype Shape Priors

1 code implementation • 30 Sep 2022 • Zhen Xing, Hengduo Li, Zuxuan Wu, Yu-Gang Jiang

In particular, we introduce an attention-guided prototype shape prior module for guiding realistic object reconstruction.

3D Reconstruction Object Reconstruction +2

Paper
Code

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

no code implementations • 15 Sep 2022 • Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.

Ranked #4 on Cross-Modal Retrieval on Flickr30k (using extra training data)

Action Classification Action Recognition +13

Paper
Add Code

Enhancing the Self-Universality for Transferable Targeted Attacks

1 code implementation • CVPR 2023 • Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang

Our new attack method is proposed based on the observation that highly universal adversarial perturbations tend to be more transferable for targeted attacks.

Paper
Code

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

no code implementations • 25 Aug 2022 • Rui Wang, Zuxuan Wu, Dongdong Chen, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Luowei Zhou, Lu Yuan, Yu-Gang Jiang

To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e. g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism.

Video Recognition

Paper
Add Code

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

no code implementations • CVPR 2023 • Lingchen Meng, Xiyang Dai, Yinpeng Chen, Pengchuan Zhang, Dongdong Chen, Mengchen Liu, JianFeng Wang, Zuxuan Wu, Lu Yuan, Yu-Gang Jiang

Detection Hub further achieves SoTA performance on UODB benchmark with wide variety of datasets.

Object object-detection +1

Paper
Add Code

Deeper Insights into the Robustness of ViTs towards Common Corruptions

no code implementations • 26 Apr 2022 • Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu-Gang Jiang

With Vision Transformers (ViTs) making great advances in a variety of computer vision tasks, recent literature have proposed various variants of vanilla ViTs to achieve better efficiency and efficacy.

Benchmarking Data Augmentation

Paper
Add Code

ObjectFormer for Image Manipulation Detection and Localization

no code implementations • CVPR 2022 • Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, Yu-Gang Jiang

Recent advances in image editing techniques have posed serious challenges to the trustworthiness of multimedia data, which drives the research of image tampering detection.

Image Manipulation Image Manipulation Detection

Paper
Add Code

Rethinking Nearest Neighbors for Visual Classification

1 code implementation • 15 Dec 2021 • Menglin Jia, Bor-Chun Chen, Zuxuan Wu, Claire Cardie, Serge Belongie, Ser-Nam Lim

In this paper, we investigate $k$-Nearest-Neighbor (k-NN) classifiers, a classical model-free learning method from the pre-deep learning era, as an augmentation to modern neural network based approaches.

Classification

Paper
Code

Cross-Modal Transferable Adversarial Attacks from Images to Videos

no code implementations • CVPR 2022 • Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang

This paper investigates the transferability of adversarial perturbation across different modalities, i. e., leveraging adversarial perturbation generated on white-box image models to attack black-box video models.

Video Recognition

Paper
Add Code

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

no code implementations • 10 Dec 2021 • Tianyi Liu, Zuxuan Wu, Wenhan Xiong, Jingjing Chen, Yu-Gang Jiang

Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model, and a feasible way to improve both tasks is to use more data.

Image-text matching Language Modelling +8

Paper
Add Code

BEVT: BERT Pretraining of Video Transformers

1 code implementation • CVPR 2022 • Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, Lu Yuan

This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i. e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations.

Ranked #8 on Action Recognition on Diving-48

Action Recognition Representation Learning

152

Paper
Code

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

no code implementations • CVPR 2022 • Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, Ser-Nam Lim

To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition.

Paper
Add Code

Efficient Video Transformers with Spatial-Temporal Token Selection

1 code implementation • 23 Nov 2021 • Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang

Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost.

Video Recognition

Paper
Code

Semi-Supervised Vision Transformers

1 code implementation • 22 Nov 2021 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

Surprisingly, we show Vision Transformers perform significantly worse than Convolutional Neural Networks when only a small set of labeled data is available.

Ranked #17 on Semi-Supervised Image Classification on ImageNet - 10% labeled data

Inductive Bias Semi-Supervised Image Classification

Paper
Code

Attacking Video Recognition Models with Bullet-Screen Comments

1 code implementation • 29 Oct 2021 • Kai Chen, Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang

On both UCF-101 and HMDB-51 datasets, our BSC attack method can achieve about 90\% fooling rate when attacking three mainstream video recognition models, while only occluding \textless 8\% areas in the video.

Adversarial Attack Adversarial Attack on Video Classification +2

Paper
Code

Boosting the Transferability of Video Adversarial Examples via Temporal Translation

1 code implementation • 18 Oct 2021 • Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang

To this end, we propose to boost the transferability of video adversarial examples for black-box attacks on video recognition models.

Adversarial Attack Translation +1

Paper
Code

Self-supervised Learning for Semi-supervised Temporal Language Grounding

no code implementations • 23 Sep 2021 • Fan Luo, Shaoxiang Chen, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.

Contrastive Learning Pseudo Label +2

Paper
Add Code

Towards Transferable Adversarial Attacks on Vision Transformers

2 code implementations • 9 Sep 2021 • Zhipeng Wei, Jingjing Chen, Micah Goldblum, Zuxuan Wu, Tom Goldstein, Yu-Gang Jiang

We evaluate the transferability of attacks on state-of-the-art ViTs, CNNs and robustly trained CNNs.

150

Paper
Code

A Multimodal Framework for Video Ads Understanding

no code implementations • 29 Aug 2021 • Zejia Weng, Lingchen Meng, Rui Wang, Zuxuan Wu, Yu-Gang Jiang

There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively.

Marketing Optical Character Recognition +5

Paper
Add Code

FT-TDR: Frequency-guided Transformer and Top-Down Refinement Network for Blind Face Inpainting

no code implementations • 10 Aug 2021 • Junke Wang, Shaoxiang Chen, Zuxuan Wu, Yu-Gang Jiang

Blind face inpainting refers to the task of reconstructing visual contents without explicitly indicating the corrupted regions in a face image.

Facial Inpainting

Paper
Add Code

Cross-domain Contrastive Learning for Unsupervised Domain Adaptation

1 code implementation • 10 Jun 2021 • Rui Wang, Zuxuan Wu, Zejia Weng, Jingjing Chen, Guo-Jun Qi, Yu-Gang Jiang

Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a fully-labeled source domain to a different unlabeled target domain.

Clustering Contrastive Learning +3

Paper
Code

Rethinking Pseudo Labels for Semi-Supervised Object Detection

no code implementations • 1 Jun 2021 • Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, Larry S. Davis

In this paper, we introduce certainty-aware pseudo labels tailored for object detection, which can effectively estimate the classification and localization quality of derived pseudo labels.

Ranked #8 on Semi-Supervised Object Detection on COCO 100% labeled data (using extra training data)

Classification Image Classification +4

Paper
Add Code

VideoLT: Large-scale Long-tailed Video Recognition

1 code implementation • ICCV 2021 • Xing Zhang, Zuxuan Wu, Zejia Weng, Huazhu Fu, Jingjing Chen, Yu-Gang Jiang, Larry Davis

In this paper, we introduce VideoLT, a large-scale long-tailed video recognition dataset, as a step toward real-world video recognition.

Image Classification Video Recognition

Paper
Code

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

1 code implementation • 24 Apr 2021 • Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zuxuan Wu, Larry Davis, Dinesh Manocha

We present a novel architecture for 3D object detection, M3DeTR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids.

Ranked #1 on 3D Object Detection on KITTI Cars Hard val

3D Object Detection object-detection +1

Paper
Code

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

no code implementations • 20 Apr 2021 • Zejia Weng, Zuxuan Wu, Hengduo Li, Jingjing Chen, Yu-Gang Jiang

Conventional video recognition pipelines typically fuse multimodal features for improved performance.

Video Recognition

Paper
Add Code

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

2 code implementations • 20 Apr 2021 • Junke Wang, Zuxuan Wu, Wenhao Ouyang, Xintong Han, Jingjing Chen, Ser-Nam Lim, Yu-Gang Jiang

The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images.

DeepFake Detection Face Swapping +1

Paper
Code

Exploring Visual Engagement Signals for Representation Learning

1 code implementation • ICCV 2021 • Menglin Jia, Zuxuan Wu, Austin Reiter, Claire Cardie, Serge Belongie, Ser-Nam Lim

Visual engagement in social media platforms comprises interactions with photo posts including comments, shares, and likes.

Bias Detection Emotion Recognition +2

Paper
Code

THAT: Two Head Adversarial Training for Improving Robustness at Scale

no code implementations • 25 Mar 2021 • Zuxuan Wu, Tom Goldstein, Larry S. Davis, Ser-Nam Lim

Many variants of adversarial training have been proposed, with most research focusing on problems with relatively few classes.

Vocal Bursts Valence Prediction

Paper
Add Code

Deep Video Inpainting Detection

no code implementations • 26 Jan 2021 • Peng Zhou, Ning Yu, Zuxuan Wu, Larry S. Davis, Abhinav Shrivastava, Ser-Nam Lim

This paper studies video inpainting detection, which localizes an inpainted region in a video both spatially and temporally.

Decoder Video Inpainting

Paper
Add Code

2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition

no code implementations • CVPR 2021 • Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, Larry S. Davis

Then, only frames and convolutions that are selected by the selection network are used in the 3D model to generate predictions.

Ranked #11 on Action Recognition on ActivityNet

Action Recognition Policy Gradient Methods +1

Paper
Add Code

GTA: Global Temporal Attention for Video Action Understanding

no code implementations • 15 Dec 2020 • Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav Shrivastava

To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner.

Action Recognition Action Understanding +1

Paper
Add Code

Intentonomy: a Dataset and Study towards Human Intent Understanding

1 code implementation • CVPR 2021 • Menglin Jia, Zuxuan Wu, Austin Reiter, Claire Cardie, Serge Belongie, Ser-Nam Lim

Based on our findings, we conduct further study to quantify the effect of attending to object and context classes as well as textual information in the form of hashtags when training an intent classifier.

Paper
Code

Robust Optimization as Data Augmentation for Large-scale Graphs

3 code implementations • CVPR 2022 • Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin Taylor, Tom Goldstein

Data augmentation helps neural networks generalize better by enlarging the training set, but it remains an open question how to effectively augment graph data to enhance the performance of GNNs (Graph Neural Networks).

Ranked #1 on Graph Property Prediction on ogbg-ppa

Data Augmentation Graph Classification +4

273

Paper
Code

Prepare for the Worst: Generalizing across Domain Shifts with Adversarial Batch Normalization

no code implementations • 28 Sep 2020 • Manli Shu, Zuxuan Wu, Micah Goldblum, Tom Goldstein

Adversarial training is the industry standard for producing models that are robust to small adversarial perturbations.

Semantic Segmentation

Paper
Add Code

Encoding Robustness to Image Style via Adversarial Feature Perturbations

1 code implementation • NeurIPS 2021 • Manli Shu, Zuxuan Wu, Micah Goldblum, Tom Goldstein

We adapt adversarial training by directly perturbing feature statistics, rather than image pixels, to produce models that are robust to various unseen distributional shifts.

Data Augmentation Semantic Segmentation

Paper
Code

Recognizing Instagram Filtered Images with Feature De-stylization

2 code implementations • 30 Dec 2019 • Zhe Wu, Zuxuan Wu, Bharat Singh, Larry S. Davis

Deep neural networks have been shown to suffer from poor generalization when small perturbations are added (like Gaussian noise), yet little work has been done to evaluate their robustness to more natural image transformations like photo filters.

Style Transfer

Paper
Code

Learning from Noisy Anchors for One-stage Object Detection

1 code implementation • CVPR 2020 • Hengduo Li, Zuxuan Wu, Chen Zhu, Caiming Xiong, Richard Socher, Larry S. Davis

State-of-the-art object detectors rely on regressing and classifying an extensive list of possible anchors, which are divided into positive and negative samples based on their intersection-over-union (IoU) with corresponding groundtruth objects.

Classification General Classification +3

Paper
Code

LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition

no code implementations • NeurIPS 2019 • Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, Larry S. Davis

This paper presents LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios.

Video Recognition

Paper
Add Code

Making an Invisibility Cloak: Real World Adversarial Attacks on Object Detectors

2 code implementations • ECCV 2020 • Zuxuan Wu, Ser-Nam Lim, Larry Davis, Tom Goldstein

We present a systematic study of adversarial attacks on state-of-the-art object detection frameworks.

Object object-detection +1

Paper
Code

Efficient Object Embedding for Spliced Image Retrieval

no code implementations • CVPR 2021 • Bor-Chun Chen, Zuxuan Wu, Larry S. Davis, Ser-Nam Lim

Detecting spliced images is one of the emerging challenges in computer vision.

Content-Based Image Retrieval General Classification +4

Paper
Add Code

ACE: Adapting to Changing Environments for Semantic Segmentation

no code implementations • ICCV 2019 • Zuxuan Wu, Xin Wang, Joseph E. Gonzalez, Tom Goldstein, Larry S. Davis

However, neural classifiers are often extremely brittle when confronted with domain shift---changes in the input distribution that occur over time.

Meta-Learning Semantic Segmentation

Paper
Add Code

An Analysis of Pre-Training on Object Detection

no code implementations • 11 Apr 2019 • Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, Larry S. Davis

We analyze how well their features generalize to tasks like image classification, semantic segmentation and object detection on small datasets like PASCAL-VOC, Caltech-256, SUN-397, Flowers-102 etc.

Avg Classification +6

Paper
Add Code

M2KD: Multi-model and Multi-level Knowledge Distillation for Incremental Learning

no code implementations • 3 Apr 2019 • Peng Zhou, Long Mai, Jianming Zhang, Ning Xu, Zuxuan Wu, Larry S. Davis

Instead of sequentially distilling knowledge only from the last model, we directly leverage all previous model snapshots.

Incremental Learning Knowledge Distillation

Paper
Add Code

The Regretful Navigation Agent for Vision-and-Language Navigation

1 code implementation • CVPR 2019 (Oral) 2019 • Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, Zsolt Kira

As deep learning continues to make progress for challenging perception tasks, there is increased interest in combining vision, language, and decision-making.

Decision Making Vision and Language Navigation +2

123

Paper
Code

The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation

3 code implementations • CVPR 2019 • Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, Zsolt Kira

As deep learning continues to make progress for challenging perception tasks, there is increased interest in combining vision, language, and decision-making.

Ranked #115 on Vision and Language Navigation on VLN Challenge

Decision Making Vision and Language Navigation +2

123

Paper
Code

Compatible and Diverse Fashion Image Inpainting

no code implementations • 4 Feb 2019 • Xintong Han, Zuxuan Wu, Weilin Huang, Matthew R. Scott, Larry S. Davis

The latent representations are jointly optimized with the corresponding generation network to condition the synthesis process, encouraging a diverse set of generated results that are visually compatible with existing fashion garments.

Fashion Synthesis Image Inpainting

Paper
Add Code

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

2 code implementations • ICLR 2019 • Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, Caiming Xiong

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.

Ranked #115 on Vision and Language Navigation on VLN Challenge

Natural Language Visual Grounding Vision and Language Navigation +2

117

Paper
Code

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

no code implementations • CVPR 2019 • Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, Larry S. Davis

We present AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition.

Policy Gradient Methods Video Recognition

Paper
Add Code

DCAN: Dual Channel-wise Alignment Networks for Unsupervised Scene Adaptation

no code implementations • ECCV 2018 • Zuxuan Wu, Xintong Han, Yen-Liang Lin, Mustafa Gkhan Uzunbas, Tom Goldstein, Ser Nam Lim, Larry S. Davis

In particular, given an image from the source domain and unlabeled samples from the target domain, the generator synthesizes new images on-the-fly to resemble samples from the target domain in appearance and the segmentation network further refines high-level features before predicting semantic maps, both of which leverage feature statistics of sampled images from the target domain.

Segmentation Semantic Segmentation

Paper
Add Code

BlockDrop: Dynamic Inference Paths in Residual Networks

1 code implementation • CVPR 2018 • Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, Rogerio Feris

Very deep convolutional neural networks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications.

139

Paper
Code

VITON: An Image-based Virtual Try-on Network

6 code implementations • CVPR 2018 • Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, Larry S. Davis

We present an image-based VIirtual Try-On Network (VITON) without using 3D information in any form, which seamlessly transfers a desired clothing item onto the corresponding region of a person using a coarse-to-fine strategy.

Descriptive Virtual Try-on

505

Paper
Code

Automatic Spatially-aware Fashion Concept Discovery

1 code implementation • ICCV 2017 • Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, Larry S. Davis

This paper proposes an automatic spatially-aware concept discovery approach using weakly labeled image-text data from shopping websites.

Ranked #8 on Image Retrieval with Multi-Modal Query on Fashion200k

Attribute Clustering +2

Paper
Code

Learning Fashion Compatibility with Bidirectional LSTMs

2 code implementations • 18 Jul 2017 • Xintong Han, Zuxuan Wu, Yu-Gang Jiang, Larry S. Davis

To this end, we propose to jointly learn a visual-semantic embedding and the compatibility relationships among fashion items in an end-to-end fashion.

Attribute

159

Paper
Code

Aggregating Frame-level Features for Large-Scale Video Classification

no code implementations • 4 Jul 2017 • Shaoxiang Chen, Xi Wang, Yongyi Tang, Xinpeng Chen, Zuxuan Wu, Yu-Gang Jiang

This paper introduces the system we developed for the Google Cloud & YouTube-8M Video Understanding Challenge, which can be considered as a multi-label classification problem defined on top of the large scale YouTube-8M Dataset.

Classification General Classification +3

Paper
Add Code

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

no code implementations • 14 Jun 2017 • Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, xiangyang xue, Shih-Fu Chang

More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features.

General Classification Video Classification

Paper
Add Code

Weakly-Supervised Spatial Context Networks

no code implementations • 10 Apr 2017 • Zuxuan Wu, Larry S. Davis, Leonid Sigal

In particular, we propose spatial context networks that learn to predict a representation of one image patch from another image patch, within the same image, conditioned on their real-valued relative spatial offset.

Object Object Categorization

Paper
Add Code

Deep Learning for Video Classification and Captioning

1 code implementation • 22 Sep 2016 • Zuxuan Wu, Ting Yao, Yanwei Fu, Yu-Gang Jiang

Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data.

Classification General Classification +3

Paper
Code

Harnessing Object and Scene Semantics for Large-Scale Video Understanding

no code implementations • CVPR 2016 • Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, Leonid Sigal

Large-scale action recognition and video categorization are important problems in computer vision.

Action Recognition Clustering +4

Paper
Add Code

Fusing Multi-Stream Deep Networks for Video Classification

no code implementations • 21 Sep 2015 • Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, xiangyang xue, Jun Wang

A multi-stream framework is proposed to fully utilize the rich multimodal information in videos.

Classification General Classification +1

Paper
Add Code

Evaluating Two-Stream CNN for Video Classification

no code implementations • 8 Apr 2015 • Hao Ye, Zuxuan Wu, Rui-Wei Zhao, Xi Wang, Yu-Gang Jiang, xiangyang xue

In this paper, we conduct an in-depth study to investigate important implementation options that may affect the performance of deep nets on video classification.

Classification General Classification +2

Paper
Add Code

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

1 code implementation • 7 Apr 2015 • Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, xiangyang xue

In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos.

Classification General Classification +1

Paper
Code

Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks

no code implementations • 25 Feb 2015 • Yu-Gang Jiang, Zuxuan Wu, Jun Wang, xiangyang xue, Shih-Fu Chang

In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as the existence of a particular human action or a complex event.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.