Search Results for author: Zhaoyang Zeng

Found 27 papers, 13 papers with code

Mind the Discriminability: Asymmetric Adversarial Domain Adaptation

no code implementations ECCV 2020 Jianfei Yang, Han Zou, Yuxun Zhou, Zhaoyang Zeng, Lihua Xie ()

Adversarial domain adaptation has made tremendous success by learning domain-invariant feature representations.

Domain Adaptation

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

1 code implementation21 Mar 2024 Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang

Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning.

Contrastive Learning Descriptive +3

TAPTR: Tracking Any Point with Transformers as Detection

no code implementations19 Mar 2024 Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang

Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP.

object-detection Object Detection +2

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

1 code implementation25 Jan 2024 Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, Lei Zhang

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM).

Segmentation

T-Rex: Counting by Visual Prompting

no code implementations22 Nov 2023 Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang

Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects.

Object Object Counting +4

DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

no code implementations ICCV 2023 Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, Lei Zhang

Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem.

3D Object Detection object-detection

detrex: Benchmarking Detection Transformers

1 code implementation12 Jun 2023 Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang

To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation.

Benchmarking object-detection +2

A Strong and Reproducible Object Detector with Only Public Datasets

2 code implementations25 Apr 2023 Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang

This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64. 6 AP on COCO val2017 and 64. 8 AP on COCO test-dev using only 700M parameters without any test time augmentation.

Ranked #5 on Object Detection on COCO minival (using extra training data)

object-detection Object Detection

Detection Transformer with Stable Matching

1 code implementation ICCV 2023 Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, Lei Zhang

We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR.

Position

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

7 code implementations9 Mar 2023 Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.

Referring Expression Referring Expression Comprehension +2

Contrastive Learning of Global and Local Video Representations

no code implementations NeurIPS 2021 Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).

Classification Contrastive Learning +4

CLIP4Caption ++: Multi-CLIP for Video Caption

no code implementations11 Oct 2021 Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, Dian Li

We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features.

Sentence

Multi-modal Representation Learning for Video Advertisement Content Structuring

no code implementations4 Sep 2021 Daya Guo, Zhaoyang Zeng

Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style.

Representation Learning Re-Ranking +1

Reference-based Defect Detection Network

no code implementations10 Aug 2021 Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao

To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression.

Defect Detection object-detection +2

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

3 code implementations CVPR 2021 Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Representation Learning Retrieval +3

Contrastive Learning of Global-Local Video Representations

1 code implementation7 Apr 2021 Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).

Classification Contrastive Learning +6

Suppressing Mislabeled Data via Grouping and Self-Attention

1 code implementation ECCV 2020 Xiaojiang Peng, Kai Wang, Zhaoyang Zeng, Qing Li, Jianfei Yang, Yu Qiao

Specifically, this plug-and-play AFM first leverages a \textit{group-to-attend} module to construct groups and assign attention weights for group-wise samples, and then uses a \textit{mixup} module with the attention weights to interpolate massive noisy-suppressed samples.

Image Classification

Active Contrastive Learning of Audio-Visual Video Representations

1 code implementation ICLR 2021 Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance.

Contrastive Learning Representation Learning +1

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

1 code implementation2 Apr 2020 Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu

We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.

Image-text matching Language Modelling +7

Learning Rich Image Region Representation for Visual Question Answering

no code implementations29 Oct 2019 Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu

We propose to boost VQA by leveraging more powerful feature extractors by improving the representation ability of both visual and text features and the ensemble of models.

Language Modelling Question Answering +1

SMP Challenge: An Overview of Social Media Prediction Challenge 2019

no code implementations4 Oct 2019 Bo Wu, Wen-Huang Cheng, Peiye Liu, Bei Liu, Zhaoyang Zeng, Jiebo Luo

In the SMP Challenge at ACM Multimedia 2019, we introduce a novel prediction task Temporal Popularity Prediction, which focuses on predicting future interaction or attractiveness (in terms of clicks, views or likes etc.)

Multimedia recommendation

WSOD^2: Learning Bottom-up and Top-down Objectness Distillation for Weakly-supervised Object Detection

1 code implementation11 Sep 2019 Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang

We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.

Object object-detection +3

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations11 Jul 2019 Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

Cannot find the paper you are looking for? You can Submit a new open access paper.