no code implementations • ECCV 2020 • Sibei Yang, Guanbin Li, Yizhou Yu
Phrase level visual grounding aims to locate in an image the corresponding visual regions referred to by multiple noun phrases in a given sentence.
no code implementations • 2 Dec 2024 • Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, Jingya Wang
In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps.
1 code implementation • 14 Jul 2024 • Cheng Shi, Yuchen Zhu, Sibei Yang
Recent advancements in large-scale foundational models have sparked widespread interest in training highly proficient large vision models.
1 code implementation • 14 Jul 2024 • Cheng Shi, Yulin Zhang, Bin Yang, Jiajin Tang, Yuexin Ma, Sibei Yang
By training Hi-Mask3D on the objects and object parts extracted from Part2Object, we achieve consistent and superior performance compared to state-of-the-art models in various settings, including unsupervised instance segmentation, data-efficient fine-tuning, and cross-dataset generalization.
1 code implementation • 18 Apr 2024 • Cheng Shi, Sibei Yang
Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks.
no code implementations • CVPR 2024 • Qiyuan Dai, Sibei Yang
Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations.
no code implementations • 21 Feb 2024 • Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, Xuming He, Yuexin Ma
In this paper, we introduce RealDex, a pioneering dataset capturing authentic dexterous hand grasping motions infused with human behavioral patterns, enriched by multi-view and multimodal visual data.
no code implementations • CVPR 2024 • Han Liang, Jiacheng Bao, Ruichi Zhang, Sihan Ren, Yuecheng Xu, Sibei Yang, Xin Chen, Jingyi Yu, Lan Xu
At the subsequent fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information, through a trainable copy of the pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block.
1 code implementation • 30 Oct 2023 • Meng Lou, Hong-Yu Zhou, Sibei Yang, Yizhou Yu
Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels.
2 code implementations • NeurIPS 2023 • Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, Sibei Yang
Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt.
no code implementations • ICCV 2023 • Jiajin Tang, Ge Zheng, Sibei Yang
Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries.
no code implementations • ICCV 2023 • Jiajin Tang, Ge Zheng, Jingyi Yu, Sibei Yang
Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection.
no code implementations • ICCV 2023 • Cheng Shi, Sibei Yang
Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks.
no code implementations • ICCV 2023 • Cheng Shi, Sibei Yang
Vision-language models such as CLIP have boosted the performance of open-vocabulary object detection, where the detector is trained on base categories but required to detect novel categories.
1 code implementation • CVPR 2023 • Jiajin Tang, Ge Zheng, Cheng Shi, Sibei Yang
Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression.
no code implementations • ICCV 2023 • Yu Wu, Yana Wei, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He
This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models.
1 code implementation • 12 Apr 2023 • Zhenxiang Lin, Xidong Peng, Peishan Cong, Ge Zheng, Yujin Sun, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma
We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data, including 2D images and 3D LiDAR point clouds.
1 code implementation • 2 Jan 2023 • Hong-Yu Zhou, Chixiang Lu, Chaoqi Chen, Sibei Yang, Yizhou Yu
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views.
no code implementations • 27 Sep 2022 • Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, Yizhou Yu
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (\emph{e. g.,} social network analysis and recommender systems), computer vision (\emph{e. g.,} object detection and point cloud learning), and natural language processing (\emph{e. g.,} relation extraction and sequence learning), to name a few.
2 code implementations • ICCV 2021 • Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Xiaoguang Han, Yizhou Yu
From this perspective, we introduce Preservational Learning to reconstruct diverse image contexts in order to preserve more information in learned representations.
no code implementations • 11 Aug 2021 • Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Yizhou Yu
Vision transformers have attracted much attention from computer vision researchers as they are not restricted to the spatial inductive bias of ConvNets.
1 code implementation • CVPR 2021 • Sibei Yang, Meng Xia, Guanbin Li, Hong-Yu Zhou, Yizhou Yu
In this paper, we tackle the challenge by jointly performing compositional visual reasoning and accurate segmentation in a single stage via the proposed novel Bottom-Up Shift (BUS) and Bidirectional Attentive Refinement (BIAR) modules.
1 code implementation • CVPR 2020 • Sibei Yang, Guanbin Li, Yizhou Yu
The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often crucial to align and jointly understand the image and the referring expression.
no code implementations • ICCV 2019 • Sibei Yang, Guanbin Li, Yizhou Yu
In this paper, we explore the problem of referring expression comprehension from the perspective of language-driven visual reasoning, and propose a dynamic graph attention network to perform multi-step reasoning by modeling both the relationships among the objects in the image and the linguistic structure of the expression.
1 code implementation • CVPR 2019 • Sibei Yang, Guanbin Li, Yizhou Yu
Unfortunately, existing work on grounding referring expressions fails to accurately extract multi-order relationships from the referring expression and associate them with the objects and their related contexts in the image.
no code implementations • CVPR 2019 • Sibei Yang, Guanbin Li, Yizhou Yu
A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i. e. objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information.
no code implementations • 27 Apr 2019 • Xiang He, Sibei Yang, Guanbin Li?, Haofeng Li, Huiyou Chang, Yizhou Yu
In this paper, we discover that global spatial dependencies and global contextual information in a biomedical image can be exploited to defend against adversarial attacks.
no code implementations • CVPR 2018 • Weifeng Ge, Sibei Yang, Yizhou Yu
In this paper, we propose a novel weakly supervised curriculum learning pipeline for multi-label object recognition, detection and semantic segmentation.
Ranked #21 on Weakly Supervised Object Detection on PASCAL VOC 2007
no code implementations • 20 Aug 2014 • Sibei Yang, Liangde Tao, Bingchen Gong
Data clustering is the process of identifying natural groupings or clusters within multidimensional data based on some similarity measure.