Since the rise of vision-language navigation (VLN), great progress has been made in instruction following -- building a follower to navigate environments under the guidance of instructions.
Prevalent semantic segmentation solutions, despite their different network designs (FCN based or attention based) and mask decoding strategies (parametric softmax based or pixel-query based), can be placed in one category, by considering the softmax weights or query vectors as learnable class prototypes.
In this paper, we instead address hierarchical semantic segmentation (HSS), which aims at structured, pixel-wise description of visual observation in terms of a class hierarchy.
Our target is to learn visual correspondence from unlabeled videos.
In this paper, we propose a new task and dataset, Visual Abductive Reasoning (VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations.
In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
We extensively evaluate the effectiveness of our framework under different TADA settings: open taxonomy, coarse-to-fine taxonomy, and partially-overlapping taxonomy.
Video segmentation, i. e., partitioning video frames into multiple segments or objects, plays a critical role in a broad range of practical applications, from enhancing visual effects in movie, to understanding scenes in autonomous driving, to virtual background creation in video conferencing, just to name a few.
As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques.
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation.
Ranked #5 on Referring Expression Segmentation on J-HMDB
To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner.
Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), i. e., entailing an agent to navigate 3D environments through following linguistic instructions.
Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.
3D data that contains rich geometry information of objects and scenes is valuable for understanding 3D physical world.
no code implementations • 17 Oct 2020 • Yunchao Wei, Shuai Zheng, Ming-Ming Cheng, Hang Zhao, LiWei Wang, Errui Ding, Yi Yang, Antonio Torralba, Ting Liu, Guolei Sun, Wenguan Wang, Luc van Gool, Wonho Bae, Junhyug Noh, Jinhwan Seo, Gunhee Kim, Hao Zhao, Ming Lu, Anbang Yao, Yiwen Guo, Yurong Chen, Li Zhang, Chuangchuang Tan, Tao Ruan, Guanghua Gu, Shikui Wei, Yao Zhao, Mariia Dobko, Ostap Viniavskyi, Oles Dobosevych, Zhendong Wang, Zhenyuan Chen, Chen Gong, Huanqing Yan, Jun He
The purpose of the Learning from Imperfect Data (LID) workshop is to inspire and facilitate the research in developing novel approaches that would harness the imperfect data and improve the data-efficiency during training.
This work proposes a weakly supervised approach for 3D object detection, only requiring a small set of weakly annotated scenes, associated with a few precisely labeled object instances.
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.
How to make a segmentation model efficiently adapt to a specific video and to online target appearance variations are fundamentally crucial issues in the field of video object segmentation.
Moreover, our approach ranked 1st place in the Weakly-Supervised Semantic Segmentation Track of CVPR2020 Learning from Imperfect Data Challenge.
In this paper, we propose a novel MOT framework that unifies object motion and affinity model into a single network, named UMA, in order to learn a compact feature that is discriminative for both object motion and affinity measure.
As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task.
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos, unlike most existing methods which rely heavily on extensive annotated data.
The interaction recognition network has two crucial parts: a relation ranking module for high-quality HOI proposal selection and a triple-stream classifier for relation prediction.
The bottom-up and top-down inferences explicitly model the compositional and decompositional relations in human bodies, respectively.
This paper proposes a human-aware deblurring model that disentangles the motion blur between foreground (FG) humans and background (BG).
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view.
Ranked #1 on Unsupervised Video Object Segmentation on YouTube
Through parametric message passing, AGNN is able to efficiently capture and mine much richer and higher-order relations between video frames, thus enabling a more complete understanding of video content and more accurate foreground estimation.
To alleviate this problem, we propose an NMT approach that heightens the adequacy in machine translation by transferring the semantic knowledge learned from bilingual sentence alignment.
This paper addresses a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions.
The top-down process is used for coarse-to-fine saliency estimation, where high-level saliency is gradually integrated with finer lower-layer features to obtain a fine-grained result.
The first is the exploitation of an essential pyramid attention structure for salient object detection.
This paper conducts a systematic study on the role of visual attention in Unsupervised Video Object Segmentation (UVOS) tasks.
In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation from a monocular RGB image.
Ranked #5 on 3D Human Pose Estimation on HumanEva-I
As an essential problem in computer vision, salient object detection (SOD) has attracted an increasing amount of research attention over the years.
This paper proposes a fast video salient object detection model, based on a novel recurrent network architecture, named Pyramid Dilated Bidirectional ConvLSTM (PDB-ConvLSTM).
Ranked #1 on Video Salient Object Detection on UVSD (using extra training data)
For a given scene, GPNN infers a parse graph that includes i) the HOI graph structure represented by an adjacency matrix, and ii) the node labels.
Ranked #20 on Human-Object Interaction Detection on V-COCO
We collect a new dataset VideoCoAtt from public TV show videos, containing 380 complex video sequences with more than 492, 000 frames that include diverse social scenes for shared attention study.
This paper proposes a knowledge-guided fashion network to solve the problem of visual fashion analysis, e. g., fashion landmark localization and clothing category classification.
Salient object detection is then viewed as fine-grained object-level saliency segmentation and is progressively optimized with the guidance of the fixation map in a top-down manner.
Hyperparameters are numerical presets whose values are assigned prior to the commencement of the learning process.
By reformulating the standard F-measure we propose the relaxed F-measure which is differentiable w. r. t the posterior and can be easily appended to the back of CNNs as the loss function.
This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns.
Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments.
In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation.
We introduce a novel semi-supervised video segmentation approach based on an efficient video representation, called as "super-trajectory".
Building on the observation that foreground areas are surrounded by the regions with high spatiotemporal edge values, geodesic distance provides an initial estimation for foreground and background.
Ranked #5 on Video Salient Object Detection on DAVSOD-Difficult20 (using extra training data)