Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS).
Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations.
This paper studies the probability of error associated with the social machine learning framework, which involves an independent training phase followed by a cooperative decision-making phase over a graph.
Pre-training VTs on such corrupted data can be challenging, especially when we pre-train via the masked autoencoding approach, where both the inputs and masked ``ground truth" targets can potentially be unreliable in this case.
Placing and orienting a camera to compose aesthetically meaningful shots of a scene is not only a key objective in real-world photography and cinematography but also for virtual content creation.
Solving multi-label recognition (MLR) for images in the low-label regime is a challenging task with many real-world applications.
Motion-based video frame interpolation commonly relies on optical flow to warp pixels from the inputs to the desired interpolation instant.
Ranked #1 on Video Frame Interpolation on Xiph-4K (Crop)
Frame interpolation is an essential video processing technique that adjusts the temporal resolution of an image sequence.
LDET leads to significant improvements on many datasets in the open-world instance segmentation task, outperforming baselines on cross-category generalization on COCO, as well as cross-dataset evaluation on UVO and Cityscapes.
Recyclable waste detection poses a unique computer vision challenge as it requires detection of highly deformable and often translucent objects in cluttered scenes without the kind of context information usually present in human-centric datasets.
Recently, most ZSS methods focus on learning the visual-semantic correspondence to transfer knowledge from seen classes to unseen classes at the pixel level.
The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism and captures the same rich spatial context at a small fraction of the computational cost, by changing the order of operations.
Ranked #32 on Semantic Segmentation on DensePASS
We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation.
Ranked #2 on Video Semantic Segmentation on Cityscapes val
Learning from a few examples is a challenging task for machine learning.
Then, the proposed Cascaded Refinement Network(CRN) takes the coarse segmentation as guidance to generate an accurate segmentation of full resolution.
Hence we propose a new class of LSTM network, Global Context-Aware Attention LSTM (GCA-LSTM), for 3D action recognition, which is able to selectively focus on the informative joints in the action sequence with the assistance of global contextual information.
Ranked #7 on One-Shot 3D Action Recognition on NTU RGB+D 120
Our method drives the network to learn a Level Set function for salient objects so it can output more accurate boundaries and compact saliency.