On the contrary, a discriminative classifier only models the conditional distribution of labels given inputs, but benefits from effective optimization owing to its succinct structure.
When smartphone cameras are used to take photos of digital screens, usually moire patterns result, severely degrading photo quality.
It is a challenging problem since (1) the identifying process is susceptible to over-fitting with limited samples of an object, and (2) the sample imbalance between a base (known knowledge) category and a novel category is easy to bias the recognition results.
To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis.
The conventional few-shot classification aims at learning a model on a large labeled base dataset and rapidly adapting to a target dataset that is from the same distribution as the base dataset.
Recent advances in text-to-3D generation have been remarkable, with methods such as DreamFusion leveraging large-scale text-to-image diffusion-based models to supervise 3D generation.
We propose to integrate the effectiveness of gamma correction with the strong modelling capacities of deep networks, which enables the correction factor gamma to be learned in a coarse to elaborate manner via adaptively perceiving the deviated illumination.
To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence.
The stereo event-intensity camera setup is widely applied to leverage the advantages of both event cameras with low latency and intensity cameras that capture accurate brightness and texture information.
Latest diffusion-based methods for many image restoration tasks outperform traditional models, but they encounter the long-time inference problem.
Text consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute word bank and serve as attributes.
However, with only a few training images, there exist two crucial problems: (1) the visual feature distributions are easily distracted by class-irrelevant information in images, and (2) the alignment between the visual and language feature distributions is difficult.
Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models.
In recent years, videos and images in 720p (HD), 1080p (FHD) and 4K (UHD) resolution have become more popular for display devices such as TVs, mobile phones and VR.
1 code implementation • 26 Apr 2023 • Bingqian Lin, Zicong Chen, Mingjie Li, Haokun Lin, Hang Xu, Yi Zhu, Jianzhuang Liu, Wenjia Cai, Lei Yang, Shen Zhao, Chenfei Wu, Ling Chen, Xiaojun Chang, Yi Yang, Lei Xing, Xiaodan Liang
In MOTOR, we combine two kinds of basic medical knowledge, i. e., general and specific knowledge, in a complementary manner to boost the general pretraining process.
Face animation has achieved much progress in computer vision.
This paper aims at demystifying a single motion-blurred image with events and revealing temporally continuous scene dynamics encrypted behind motion blurs.
IDM integrates an implicit neural representation and a denoising diffusion model in a unified end-to-end framework, where the implicit neural representation is adopted in the decoding process to learn continuous-resolution representation.
Ranked #1 on Image Super-Resolution on CelebA-HQ 128x128
Super-Resolution from a single motion Blurred image (SRB) is a severely ill-posed problem due to the joint degradation of motion blurs and low spatial resolution.
Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position.
Specifically, we first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
Pre-training a vison-language model and then fine-tuning it on downstream tasks have become a popular paradigm.
However, compared to image-language pre-training, VLP has lagged far behind due to the lack of large amounts of video-text pairs.
The parallel isomeric attention module is used as the video encoder, which consists of two parallel branches modeling the spatial-temporal information of videos from both patch and frame levels.
Extensive experiments on seven benchmark datasets verify that proposed SmartAssign explores effective connection between rain and snow, and improves the performances of both deraining and desnowing apparently.
FC-Net is based on the observation that the visible parts of pedestrians are selective and decisive for detection, and is implemented as a self-paced feature learning framework with a self-activation (SA) module and a feature calibration (FC) module.
Then in the Sentence-Mask Alignment (SMA) module, the masks are weighted by the sentence embedding to localize the referred object, and finally projected back to aggregate the pixels for the target.
In FNeVR, we design a 3D Face Volume Rendering (FVR) module to enhance the facial details for image rendering.
To this end, we propose a novel Structure-Preserving Graph Representation Learning (SPGRL) method, to fully capture the structure information of graphs.
Inspired by our studies, we propose to remove rain by learning favorable deraining representations from other connected tasks.
Second, according to the similarity between incremental knowledge and base knowledge, we design an adaptive fusion of incremental knowledge, which helps the model allocate capacity to the knowledge of different difficulties.
Low-light video enhancement (LLVE) is an important yet challenging task with many applications such as photographing and autonomous driving.
Top-down methods dominate the field of 3D human pose and shape estimation, because they are decoupled from human detection and allow researchers to focus on the core problem.
Ranked #1 on Unsupervised 3D Human Pose Estimation on Human3.6M (PA-MPJPE metric)
Specifically, we maximize the mutual information (MI) of instances and their representations with a low-bias MI estimator to perform self-supervised pre-training.
Vision-Language Navigation (VLN) is a challenging task that requires an embodied agent to perform action-level modality alignment, i. e., make instruction-asked actions sequentially in complex visual environments.
To tackle this problem, we propose a depth solving system that fully explores the visual clues from the subtasks in M3OD and generates multiple estimations for the depth of each target.
Color constancy aims to restore the constant colors of a scene under different illuminants.
The key of referring expression comprehension lies in capturing the cross-modal visual-linguistic relevance.
Building upon RMI, we further propose a new search algorithm termed RMI-NAS, facilitating with a theorem to guarantee the global optimal of the searched architecture.
For zero-shot image restoration, we design a novel model, termed SiamTrans, which is constructed by Siamese transformers, encoders, and decoders.
In this paper, we observe an interesting phenomenon of intra-class heterogeneity in real data and show that existing methods fail to retain this property in their synthetic images, which causes a limited performance increase.
In this paper, we propose an end-to-end learning framework for event-based motion deblurring in a self-supervised manner, where real-world events are exploited to alleviate the performance degradation caused by data inconsistency.
The frequency-guided upsampling module reconstructs details from multiple frequency-specific components with rich details.
To obtain a single model that works across multiple target domains, we propose to simultaneously learn a student model which is trained to not only imitate the output of each expert on the corresponding target domain, but also to pull different expert close to each other with regularization on their weights.
Ranked #3 on Domain Adaptation on GTAV to Cityscapes+Mapillary
Powered by these two designs, Uformer enjoys a high capability for capturing both local and global dependencies for image restoration.
Ranked #1 on Deblurring on RealBlur-R (trained on GoPro)
Channel pruning and tensor decomposition have received extensive attention in convolutional neural network compression.
Despite the substantial progress of active learning for image recognition, there still lacks an instance-level active learning method specified for object detection.
Ranked #1 on Active Object Detection on PASCAL VOC 07+12
We prove that reviving the "dead weights" by ReCU can result in a smaller quantization error.
To address the task of SSDA, a novel framework based on dual-level domain mixing is proposed.
Multi-source unsupervised domain adaptation~(MSDA) aims at adapting models trained on multiple labeled source domains to an unlabeled target domain.
Ranked #1 on Domain Adaptation on GTA5+Synscapes to Cityscapes
In this paper, we present a very simple yet effective method named Neighbor2Neighbor to train an effective image denoising model with only noisy images.
The experiments show that our method achieves new state-of-the-art on the lane detection benchmarks.
Due to the superior ability of global dependency modeling, Transformer and its variants have become the primary choice of many vision-and-language tasks.
To avoid such problematic models in occluded person ReID, we propose the Occlusion-Aware Mask Network (OAMN).
In this paper, we propose a self-adaptive learning method for demoiréing a high-frequency image, with the help of an additional defocused moiré-free blur image.
In this paper, we propose a self-adaptive learning method for demoireing a high-frequency image, with the help of an additional defocused moire-free blur image.
In this paper, binarized neural architecture search (BNAS), with a search space of binarized convolutions, is introduced to produce extremely compressed models to reduce huge computational cost on embedded devices for edge computing.
This paper presents a learning-based approach to synthesize the view from an arbitrary camera position given a sparse set of images.
Domain generalization (DG) serves as a promising solution to handle person Re-Identification (Re-ID), which trains the model using labels from the source domain alone, and then directly adopts the trained model to the target domain without model updating.
When smartphone cameras are used to take photos of digital screens, usually moire patterns result, severely degrading photo quality.
For reducing the solution space, we first model the adversarial perturbation optimization problem as a process of recovering frequency-sparse perturbations with compressed sensing, under the setting that random noise in the low-frequency space is more likely to be adversarial.
In this paper, we propose a novel high-order residual network to learn the geometric features hierarchically from the LF for reconstruction.
Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors.
We introduce the first method for automatic image generation from scene-level freehand sketches.
Ranked #2 on Sketch-to-Image Translation on SketchyCOCO
Our approach, referred to as FilterSketch, encodes the second-order information of pre-trained weights, which enables the representation capacity of pruned networks to be recovered with a simple fine-tuning procedure.
In this paper, we propose a Multiple Instance Learning (MIL) approach that selects anchors and jointly optimizes the two modules of a CNN-based object detector.
Ranked #119 on Object Detection on COCO test-dev
A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models.
The BGA method is proposed to modify the binary process of GBCNs to alleviate the local minima problem, which can significantly improve the performance of 1-bit DCNNs.
By presenting a target attention loss, the pedestrian features extracted from the foreground branch become more insensitive to the backgrounds, which greatly reduces the negative impacts of changing backgrounds on matching an identical across different camera views.
The CiFs can be easily incorporated into existing deep convolutional neural networks (DCNNs), which leads to new Circulant Binary Convolutional Networks (CBCNs).
The task of single image super-resolution (SISR) aims at reconstructing a high-resolution (HR) image from a low-resolution (LR) image.
Binarized convolutional neural networks (BCNNs) are widely used to improve memory and computation efficiency of deep convolutional neural networks (DCNNs) for mobile and AI chips based applications.
Deep convolutional neural networks (DCNNs) have dominated the recent developments in computer vision through making various record-breaking models.
Therefore, NAS can be transformed to a multinomial distribution learning problem, i. e., the distribution is optimized to have a high expectation of the performance.
The relationship between the input feature maps and 2D kernels is revealed in a theoretical framework, based on which a kernel sparsity and entropy (KSE) indicator is proposed to quantitate the feature map importance in a feature-agnostic manner to guide model compression.
The advancement of deep convolutional neural networks (DCNNs) has driven significant improvement in the accuracy of recognition systems for many computer vision tasks.
Specifically, the TARM is deployed in a residual learning module that employs a novel attention learning network to recalibrate the temporal attention of frames in a skeleton sequence.
Ranked #83 on Skeleton Based Action Recognition on NTU RGB+D
Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing.
Steerable properties dominate the design of traditional filters, e. g., Gabor filters, and endow features the capability of dealing with spatial transformations.
In this paper, we propose a new approach to overcome the representation and matching problems in age invariant face recognition.
We propose a very intuitive and simple approximation for the conventional spectral clustering methods.
Reconstructing 3D objects from single line drawings is often desirable in computer vision and graphics applications.