Unlike previous integral gradient methods, our FAIG aims at finding the most discriminative filters instead of input pixels/features for degradation removal in blind SR networks.
In this work, we argue that the features extracted from the pretrained extractor, e. g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy.
Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images.
Recently, query based deep networks catch lots of attention owing to their end-to-end pipeline and competitive results on several fundamental computer vision tasks, such as object detection, semantic segmentation, and instance segmentation.
The key insight of QueryInst is to leverage the intrinsic one-to-one correspondence in object queries across different stages, as well as one-to-one correspondence between mask RoI features and object queries in the same stage.
Ranked #6 on Instance Segmentation on COCO test-dev
For temporal information modeling in VIS, we present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
Ranked #1 on Video Instance Segmentation on OVIS validation
Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.
Short video applications like TikTok and Kwai have been a great hit recently.
Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details.
Ranked #1 on Blind Face Restoration on CelebA-Test
Motivated by the observation, in this paper, we propose to investigate the interaction detection problem from a novel topological perspective by analyzing the connectivity in neural networks.
Detecting statistical interactions between input features is a crucial and challenging task.
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments.
Building on top of the powerful concept of semantic learning, this paper proposes a Recurrent Binary Embedding (RBE) model that learns compact representations for real-time retrieval.
Built on top of a representative DNN model called Deep Crossing, and two forest/tree-based models including XGBoost and LightGBM, a two-step Deep Embedding Forest algorithm is demonstrated to achieve on-par or slightly better performance as compared with the DNN counterpart, with only a fraction of serving time on conventional hardware.