Search Results for author: Zhaoyang Zeng

Found 27 papers, 13 papers with code

Mind the Discriminability: Asymmetric Adversarial Domain Adaptation

no code implementations • ECCV 2020 • Jianfei Yang, Han Zou, Yuxun Zhou, Zhaoyang Zeng, Lihua Xie ()

Adversarial domain adaptation has made tremendous success by learning domain-invariant feature representations.

Paper
Add Code

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

1 code implementation • 21 Mar 2024 • Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang

Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning.

Contrastive Learning Descriptive +3

1,855

Paper
Code

TAPTR: Tracking Any Point with Transformers as Detection

no code implementations • 19 Mar 2024 • Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang

Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP.

object-detection Object Detection +2

Paper
Add Code

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

1 code implementation • 25 Jan 2024 • Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, Lei Zhang

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM).

Segmentation

13,458

Paper
Code

T-Rex: Counting by Visual Prompting

no code implementations • 22 Nov 2023 • Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang

Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects.

Object Object Counting +4

Paper
Add Code

DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

no code implementations • ICCV 2023 • Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, Lei Zhang

Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem.

3D Object Detection object-detection

Paper
Add Code

detrex: Benchmarking Detection Transformers

1 code implementation • 12 Jun 2023 • Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang

To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation.

Benchmarking object-detection +2

1,818

Paper
Code

A Strong and Reproducible Object Detector with Only Public Datasets

2 code implementations • 25 Apr 2023 • Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang

This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64. 6 AP on COCO val2017 and 64. 8 AP on COCO test-dev using only 700M parameters without any test time augmentation.

Ranked #5 on Object Detection on COCO minival (using extra training data)

object-detection Object Detection

649

Paper
Code

Detection Transformer with Stable Matching

1 code implementation • ICCV 2023 • Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, Lei Zhang

We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR.

Position

177

Paper
Code

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

7 code implementations • 9 Mar 2023 • Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.

Ranked #1 on Zero-Shot Object Detection on MSCOCO

Referring Expression Referring Expression Comprehension +2

124,984

Paper
Code

Tencent-MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Video Similarity Evaluation

no code implementations • CVPR 2022 • Zhaoyang Zeng, Yongsheng Luo, Zhenhua Liu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen

In this paper, we propose the Tencent-MVSE dataset, which is the first benchmark dataset for the multi-modal video similarity evaluation task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Contrastive Learning of Global and Local Video Representations

no code implementations • NeurIPS 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).

Classification Contrastive Learning +4

Paper
Add Code

CLIP4Caption ++: Multi-CLIP for Video Caption

no code implementations • 11 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, Dian Li

We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features.

Sentence

Paper
Add Code

Multi-modal Representation Learning for Video Advertisement Content Structuring

no code implementations • 4 Sep 2021 • Daya Guo, Zhaoyang Zeng

Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style.

Representation Learning Re-Ranking +1

Paper
Add Code

Reference-based Defect Detection Network

no code implementations • 10 Aug 2021 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao

To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression.

Defect Detection object-detection +2

Paper
Add Code

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

3 code implementations • CVPR 2021 • Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Ranked #5 on Visual Entailment on SNLI-VE val

Representation Learning Retrieval +3

206

Paper
Code

Contrastive Learning of Global-Local Video Representations

1 code implementation • 7 Apr 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Classification Contrastive Learning +6

Paper
Code

Contrastive Self-Supervised Learning of Global-Local Audio-Visual Representations

no code implementations • 1 Jan 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Contrastive self-supervised learning has delivered impressive results in many audio-visual recognition tasks.

Classification DeepFake Detection +5

Paper
Add Code

Suppressing Mislabeled Data via Grouping and Self-Attention

1 code implementation • ECCV 2020 • Xiaojiang Peng, Kai Wang, Zhaoyang Zeng, Qing Li, Jianfei Yang, Yu Qiao

Specifically, this plug-and-play AFM first leverages a \textit{group-to-attend} module to construct groups and assign attention weights for group-wise samples, and then uses a \textit{mixup} module with the attention weights to interpolate massive noisy-suppressed samples.

Image Classification

Paper
Code

Active Contrastive Learning of Audio-Visual Video Representations

1 code implementation • ICLR 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance.

Contrastive Learning Representation Learning +1

Paper
Code

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

1 code implementation • 2 Apr 2020 • Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu

We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.

Image-text matching Language Modelling +7

436

Paper
Code

Learning Rich Image Region Representation for Visual Question Answering

no code implementations • 29 Oct 2019 • Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu

We propose to boost VQA by leveraging more powerful feature extractors by improving the representation ability of both visual and text features and the ensemble of models.

Language Modelling Question Answering +1

Paper
Add Code

SMP Challenge: An Overview of Social Media Prediction Challenge 2019

no code implementations • 4 Oct 2019 • Bo Wu, Wen-Huang Cheng, Peiye Liu, Bei Liu, Zhaoyang Zeng, Jiebo Luo

In the SMP Challenge at ACM Multimedia 2019, we introduce a novel prediction task Temporal Popularity Prediction, which focuses on predicting future interaction or attractiveness (in terms of clicks, views or likes etc.)

Multimedia recommendation

Paper
Add Code

WSOD2: Learning Bottom-Up and Top-Down Objectness Distillation for Weakly-Supervised Object Detection

1 code implementation • ICCV 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang

We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.

Object object-detection +3

Paper
Code

WSOD2: Learning Bottom-up and Top-down Objectness Distillation forWeakly-supervised Object Detection

no code implementations • ICCV 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang

We study on weakly-supervised object detection (WSOD)which plays a vital role in relieving human involvement fromobject-level annotations.

Ranked #6 on Weakly Supervised Object Detection on PASCAL VOC 2007

Object object-detection +2

Paper
Add Code

WSOD^2: Learning Bottom-up and Top-down Objectness Distillation for Weakly-supervised Object Detection

1 code implementation • 11 Sep 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang

We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.

Object object-detection +3

Paper
Code

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations • 11 Jul 2019 • Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.