Off-the-shelf single-stage multi-person pose regression methods generally leverage the instance score (i. e., confidence of the instance localization) to indicate the pose quality for selecting the pose candidates.
Multi-person pose estimation methods generally follow top-down and bottom-up paradigms, both of which can be considered as two-stage approaches thus leading to the high computation cost and low efficiency.
For emerging content-based feature fusion, most existing matting methods only focus on local features which lack the guidance of a global feature with strong semantic information related to the interesting object.
ByteTrack also achieves state-of-the-art performance on MOT20, HiEve and BDD100K tracking benchmarks.
Ranked #1 on Multi-Object Tracking on MOT17 (using extra training data)
Large-scale labeled training data is often difficult to collect, especially for person identities.
Video scene parsing is a long-standing challenging task in computer vision, aiming to assign pre-defined semantic labels to pixels of all frames in a given video.
To address this, this paper proposes to mine the contextual information beyond individual images to further augment the pixel representations.
In this work, we present a single-stage model, Body Meshes as Points (BMP), to simplify the pipeline and lift both efficiency and performance.
We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).
Specifically, our proposed network consists of three main parts: Siamese Encoder Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
To alleviate these issues, we propose a novel Spatial Preserve and Content-aware Network(SPCNet), which includes two effective modules: Dilated Hourglass Module(DHM) and Selective Information Module(SIM).
Semi-supervised video object segmentation is an interesting yet challenging task in machine learning.
In this paper, a novel Context-and-Spatial Aware Network (CSANet), which integrates both a Context Aware Path and Spatial Aware Path, is proposed to obtain effective features involving both context information and spatial information.
In this work, we propose a mask propagation network to treat the video segmentation problem as a concept of the guided instance segmentation.