no code implementations • ECCV 2020 • Jianfei Yang, Han Zou, Yuxun Zhou, Zhaoyang Zeng, Lihua Xie ()
Adversarial domain adaptation has made tremendous success by learning domain-invariant feature representations.
no code implementations • 27 Nov 2024 • Jinyuan Qu, Hongyang Li, Shilong Liu, Tianhe Ren, Zhaoyang Zeng, Lei Zhang
In this paper, we present TAPTRv3, which is built upon TAPTRv2 to improve its point tracking robustness in long videos.
1 code implementation • 27 Nov 2024 • Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang
From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding.
1 code implementation • 21 Nov 2024 • Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, Lei Zhang
DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1. 5 to pursue an object-level representation for open-world object understanding.
no code implementations • 23 Jul 2024 • Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Feng Li, Tianhe Ren, Bohan Li, Lei Zhang
In this paper, we present TAPTRv2, a Transformer-based approach built upon TAPTR for solving the Tracking Any Point (TAP) task.
no code implementations • 17 May 2024 • Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, Jiebo Luo
The research progress analysis provides an overall analysis of the solutions and trends in recent years.
3 code implementations • 16 May 2024 • Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang
Empirical results demonstrate the effectiveness of Grounding DINO 1. 5, with the Grounding DINO 1. 5 Pro model attaining a 54. 3 AP on the COCO detection benchmark and a 55. 7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection.
Ranked #1 on
Zero-Shot Object Detection
on MSCOCO
(AP metric, using extra
training data)
2 code implementations • 21 Mar 2024 • Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang
Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning.
no code implementations • 19 Mar 2024 • Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang
Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP.
4 code implementations • 25 Jan 2024 • Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, Lei Zhang
We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM).
no code implementations • 22 Nov 2023 • Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang
Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects.
no code implementations • ICCV 2023 • Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, Lei Zhang
Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem.
1 code implementation • 12 Jun 2023 • Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang
To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation.
3 code implementations • 25 Apr 2023 • Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang
This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64. 6 AP on COCO val2017 and 64. 8 AP on COCO test-dev using only 700M parameters without any test time augmentation.
Ranked #5 on
Object Detection
on COCO test-dev
2 code implementations • ICCV 2023 • Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, Lei Zhang
We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR.
10 code implementations • 9 Mar 2023 • Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang
To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.
Ranked #2 on
Zero Shot Segmentation
on Segmentation in the Wild
no code implementations • CVPR 2022 • Zhaoyang Zeng, Yongsheng Luo, Zhenhua Liu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen
In this paper, we propose the Tencent-MVSE dataset, which is the first benchmark dataset for the multi-modal video similarity evaluation task.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • NeurIPS 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).
no code implementations • 11 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, Dian Li
We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features.
no code implementations • 4 Sep 2021 • Daya Guo, Zhaoyang Zeng
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style.
no code implementations • 10 Aug 2021 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao
To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression.
3 code implementations • CVPR 2021 • Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.
Ranked #5 on
Visual Entailment
on SNLI-VE val
1 code implementation • 7 Apr 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).
no code implementations • 1 Jan 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
Contrastive self-supervised learning has delivered impressive results in many audio-visual recognition tasks.
1 code implementation • ECCV 2020 • Xiaojiang Peng, Kai Wang, Zhaoyang Zeng, Qing Li, Jianfei Yang, Yu Qiao
Specifically, this plug-and-play AFM first leverages a \textit{group-to-attend} module to construct groups and assign attention weights for group-wise samples, and then uses a \textit{mixup} module with the attention weights to interpolate massive noisy-suppressed samples.
1 code implementation • ICLR 2021 • Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance.
1 code implementation • 2 Apr 2020 • Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.
no code implementations • 29 Oct 2019 • Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu
We propose to boost VQA by leveraging more powerful feature extractors by improving the representation ability of both visual and text features and the ensemble of models.
no code implementations • 4 Oct 2019 • Bo Wu, Wen-Huang Cheng, Peiye Liu, Bei Liu, Zhaoyang Zeng, Jiebo Luo
In the SMP Challenge at ACM Multimedia 2019, we introduce a novel prediction task Temporal Popularity Prediction, which focuses on predicting future interaction or attractiveness (in terms of clicks, views or likes etc.)
1 code implementation • ICCV 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang
We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.
1 code implementation • 11 Sep 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang
We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.
no code implementations • ICCV 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang
We study on weakly-supervised object detection (WSOD)which plays a vital role in relieving human involvement fromobject-level annotations.
Ranked #12 on
Weakly Supervised Object Detection
on PASCAL VOC 2007
no code implementations • 11 Jul 2019 • Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann
The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.