Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio?
Thus, the optimum of the distillation loss does not necessarily lead to the optimal student classification scores for dense object detectors.
Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored.
Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever the attacker's backdoor trigger is present.
In this work, we propose to explicitly model heights in the BEV space, which needs no extra data like LiDAR and can fit arbitrary camera rigs and types compared to modeling depths.
In this work, we propose to solve the hard sample issue with a Memory-augmented Progressive Learning network (GaitMPL), including Dynamic Reweighting Progressive Learning module (DRPL) and Global Structure-Aligned Memory bank (GSAM).
The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches.
In this paper, we propose a simple yet effective transformer framework for self-supervised learning called DenseDINO to learn dense visual representations.
Gait is one of the most promising biometrics that aims to identify pedestrians from their walking patterns.
Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude.
Towards this goal, MetaGait injects meta-knowledge, which could guide the model to perceive sample-specific properties, into the calibration network of the attention mechanism to improve the adaptiveness from the omni-scale, omni-dimension, and omni-process perspectives.
Different from universal object detection, referring expression comprehension (REC) aims to locate specific objects referred to by natural language expressions.
By conducting a complexity analysis, we prove that DDPG-based solutions achieve runtimes in the range of sub-milliseconds, meeting the strict latency requirements of C-V2N services.
We argue that there is a scope to improve the fusion performance with the help of the FusionBooster, a model specifically designed for the fusion task.
To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form.
Ranked #1 on Layout-to-Image Generation on Visual Genome 128x128
Point cloud panoptic segmentation is a challenging task that seeks a holistic solution for both semantic and instance segmentation to predict groupings of coherent points.
Automatic prohibited item detection in security inspection X-ray images is necessary for transportation. The abundance and diversity of the X-ray security images with prohibited item, termed as prohibited X-ray security images, are essential for training the detection model.
In this study, we propose an improved model called DeSTSeg, which integrates a pre-trained teacher network, a denoising student encoder-decoder, and a segmentation network into one framework.
Ranked #31 on Anomaly Detection on MVTec AD
To explore the role of the relation between edges, this paper proposes a novel Adaptive Edge-to-Edge Interaction Learning module, which aims to enhance the point-to-point relation through modelling the edge-to-edge interaction in the local region adaptively.
Existing methods mainly extract the text information from only one sentence to represent an image and the text representation effects the quality of the generated image well.
Bird's eye view (BEV) representation is a new perception formulation for autonomous driving, which is based on spatial fusion.
In this paper, to address this problem, we propose a novel cost-efficient Dynamic Low-resolution Distillation (DLD) text spotting framework, which aims to infer images in different small but recognizable resolutions and achieve a better balance between accuracy and efficiency.
Denoising Diffusion Probabilistic Model (DDPM) is able to make flexible conditional image generation from prior noise to real data, by introducing an independent noise-aware classifier to provide conditional gradient guidance at each time step of denoising process.
Ranked #1 on Conditional Image Generation on ImageNet 128x128
With the help of the anchor-driven representation, we then reformulate the lane detection task as an ordinal classification problem to get the coordinates of lanes.
To transfer knowledge between discriminators, we design a multi-level discriminant knowledge distillation from the source discriminator to the target discriminator on both the real and fake samples.
Formulated as a conditional generation problem, face animation aims at synthesizing continuous face images from a single source image driven by a set of conditional face motion.
In this paper, we derive closed-form formulas of first-order approximation for down-and-out barrier and floating strike lookback put option prices under a stochastic volatility model, by using an asymptotic approach.
In this paper, inspired by self-training of semi-supervised learning, we pro? pose a novel approach to solve the lack of annotated data from another angle, called medical image pixel rearrangement (short in MIPR).
As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning.
In practice, new images are usually made available in a consecutive manner, leading to a problem called Continual Semantic Segmentation (CSS).
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years.
When training samples are scarce, the semantic embedding technique, ie, describing class labels with attributes, provides a condition to generate visual features for unseen objects by transferring the knowledge from seen objects.
Considering the frontier advances of Transformer architecture in the computer vision field, in this paper, we present the first attempt at designing a Transformer-based damage assessment architecture (DamFormer).
Ranked #6 on Extracting Buildings In Remote Sensing Images on xBD
Recently, there is growing attention on one-stage panoptic segmentation methods which aim to segment instances and stuff jointly within a fully convolutional pipeline efficiently.
The global influential factor of the reference to the citing paper is the product of the local influential factor and the total influential factor of the citing paper.
A DNN being attacked will predict to an attacker-desired target class whenever a test sample from any source class is embedded with a backdoor pattern; while correctly classifying clean (attack-free) test samples.
As a challenging task, unsupervised person ReID aims to match the same identity with query images which does not require any labeled information.
We describe a gradient-based method to discover local error maximizers of a deep neural network (DNN) used for regression, assuming the availability of an "oracle" capable of providing real-valued supervision (a regression target) for samples.
The key issue of the direct recognition is to preserve identity information of secret images into container images and make container images look similar to cover images at the same time.
With the rapid development of social media, tremendous videos with new classes are generated daily, which raise an urgent demand for video classification methods that can continuously update new classes while maintaining the knowledge of old videos with limited storage and computing resources.
In this paper, we propose a novel image process scheme called class-based expansion learning for image classification, which aims at improving the supervision-stimulation frequency for the samples of the confusing classes.
These tokens or phrases may originate from primary fragmental textual pieces (e. g., segments) in the original text and are separated into different segments.
To diversify the extrinsic factors of gait, we build a complicated scene with a dense camera layout.
This paper presents a better TLS approach for automatically and dynamically determining the TLS timeline length.
Data Poisoning (DP) is an effective attack that causes trained classifiers to misclassify their inputs.
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years.
Inspired by the observation that audiences have different visual preferences on foreground and background objects, we for the first time propose to use saliency masks in the evaluation processes of the task of video frame interpolation.
In principle, the feature modeling scheme is carried out in a depth-sensitive attention module, which leads to the RGB feature enhancement as well as the background distraction reduction by capturing the depth geometry prior.
In this paper, we propose a temporal-position-sensitive context modeling approach to incorporate both positional and semantic information for more precise action localization.
Unsupervised domain adaptation (UDA) typically carries out knowledge transfer from a label-rich source domain to an unlabeled target domain by adversarial learning.
Photorealistic style transfer is a challenging task, which demands the stylized image remains real.
The paths leading to future networks are pointing towards a data-driven paradigm to better cater to the explosive growth of mobile services as well as the increasing heterogeneity of mobile devices, many of which generate and consume large volumes and variety of data.
Networking and Internet Architecture
With the motivation of practical gait recognition applications, we propose to automatically create a large-scale synthetic gait dataset (called VersatileGait) by a game engine, which consists of around one million silhouette sequences of 11, 000 subjects with fine-grained attributes in various complicated scenarios.
With the proof, we naturally generalize the compression of the channel attention mechanism in the frequency domain and propose our method with multi-spectral channel attention, termed as FcaNet.
To the best of our knowledge, our CADDet is the first work to introduce dynamic routing mechanism in object detection.
Arbitrary-shaped text detection is a challenging task due to the complex geometric layouts of texts such as large aspect ratios, various scales, random rotations and curve shapes.
To cope with the forgetting problem, many CIL methods transfer the knowledge of old classes by preserving some exemplar samples into the size-constrained memory buffer.
As an important and challenging problem, multi-domain learning (MDL) typically seeks for a set of effective lightweight domain-specific adapter modules plugged into a common domain-agnostic network.
In real-world applications, networks usually consist of billions of various types of nodes and edges with abundant attributes.
Human motion prediction, which aims at predicting future human skeletons given the past ones, is a typical sequence-to-sequence problem.
As a challenging problem, few-shot class-incremental learning (FSCIL) continually learns a sequence of tasks, confronting the dilemma between slow forgetting of old knowledge and fast adaptation to new knowledge.
In this paper, we propose a novel learning scheme called epoch-evolving Gaussian Process Guided Learning (GPGL), which aims at characterizing the correlation information between the batch-level distribution and the global data distribution.
We survey, (I) the original GAN model and its modified classical versions, (II) detail analysis of various GAN applications in different domains, (III) detail study about the various GAN training obstacles as well as training solutions.
In this paper, we see knowledge distillation in a fresh light, using the knowledge gap, or the residual, between a teacher and a student as guidance to train a much more lightweight student, called a res-student.
In this paper, we see dynamic routing networks in a fresh light, formulating a routing method as a mapping from a sample space to a routing space.
In order to solve this problem, the research proposes an unsupervised foreground segmentation method based on semantic-apparent feature fusion (SAFF).
Based on the feature fusion, our Context Feature Rectification~(CFR) module learns the model's difference from a per-frame model to correct the warped features.
Different from many other attributes, facial expression can change in a continuous way, and therefore, a slight semantic change of input should also lead to the output fluctuation limited in a small scale.
Modern methods mainly regard lane detection as a problem of pixel-wise segmentation, which is struggling to address the problem of challenging scenarios and speed.
Ranked #40 on Lane Detection on CULane
Visual tracking is typically solved as a discriminative learning problem that usually requires high-quality samples for online model adaptation.
To satisfy the stringent requirements on computational resources in the field of real-time semantic segmentation, most approaches focus on the hand-crafted design of light-weight segmentation networks.
Panoptic segmentation aims to perform instance segmentation for foreground instances and semantic segmentation for background stuff simultaneously.
Real-time semantic video segmentation is a challenging task due to the strict requirements of inference speed.
Semantic segmentation tasks based on weakly supervised condition have been put forward to achieve a lightweight labeling process.
To solve this problem, we added the box regression module to the weakly supervised object detection network and proposed a proposal scoring network (PSNet) to supervise it.
Ranked #16 on Weakly Supervised Object Detection on PASCAL VOC 2007
Unlike previous works that use a simplified search space and stack a repeatable cell to form a network, we introduce a novel search mechanism with new search space where a lightweight model can be effectively explored through the cell-level diversity and latencyoriented constraint.
While correlations between parts are ignored in the previous methods, to leverage the relations of different parts, we propose an innovative adaptive graph representation learning scheme for video person Re-ID, which enables the contextual interactions between relevant regional features.
Ranked #3 on Person Re-Identification on PRID2011
With the representation effectiveness, skeleton-based human action recognition has received considerable research attention, and has a wide range of real applications.
Video object segmentation aims at accurately segmenting the target object regions across consecutive frames.
We consider a content-caching system thatis shared by a number of proxies.
Performance Networking and Internet Architecture
Segmentation of pancreas is important for medical image analysis, yet it faces great challenges of class imbalance, background distractions and non-rigid geometrical features.
To tackle this challenge, we present a novel pipeline comprised of an Observer Engine and a Physicist Engine by respectively imitating the actions of an observer and a physicist in the real world.
As a fundamental and challenging problem in computer vision, hand pose estimation aims to estimate the hand joint locations from depth images.
In this work, we explore the cross-scale similarity in crowd counting scenario, in which the regions of different scales often exhibit high visual similarity.
Then in the top-down step, the refined object regions are used as supervision to train the segmentation network and to predict object masks.
Localizing text in the wild is challenging in the situations of complicated geometric layout of the targets like random orientation and large aspect ratio.
A critical and challenging problem in reinforcement learning is how to learn the state-action value function from the experience replay buffer and simultaneously keep sample efficiency and faster convergence to a high quality solution.
A key problem in deep multi-attribute learning is to effectively discover the inter-attribute correlation structures.
In particular, we learn separate deep representations for semantic-components and color-texture distributions from two person images and then employ pyramid person matching network (PPMN) to obtain correspondence representations.
In this work, we present a deep convolutional pyramid person matching network (PPMN) with specially designed Pyramid Matching Module to address the problem of person re-identification.
Despite the recent progress in image dehazing, several problems remain largely unsolved such as robustness for varying scenes, the visual quality of reconstructed images, and effectiveness and flexibility for applications.
The interpolation, prediction, and feature analysis of fine-gained air quality are three important topics in the area of urban air computing.
For adaptable knowledge transfer, we devise a Semantic Correlation Regularization (SCR) approach to regularize the boosted model to be consistent with the inter-class semantic correlations.
As a result, a key issue in video saliency detection is how to effectively capture the intrinsical properties of atomic video structures as well as their associated contextual interactions along the spatial and temporal dimensions.
In this paper, we propose an end-to-end group-wise deep co-saliency detection approach to address the co-salient object discovery problem based on the fully convolutional network (FCN) with group input and group output.
Therefore, a key issue to solve in this area is how to effectively model the multi-scale correspondence structure properties in an adaptive end-to-end learning fashion.
In this paper, we address the problem of person re-identification, which refers to associating the persons captured from different cameras.
Ranked #102 on Person Re-Identification on Market-1501
As an important and challenging problem in computer vision, zero-shot learning (ZSL) aims at automatically recognizing the instances from unseen object classes without training data.
A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow.
The highly effective visual representation and deep context models ensure that our framework makes a deep semantic understanding of the scene and motion pattern, consequently improving the performance of the visual path prediction task.
In this paper, we propose a real-time 3D hand pose estimation algorithm using the randomized decision forest framework.
A key problem in salient object detection is how to effectively model the semantic properties of salient objects in a data-driven manner.
Object identification results for an entire video sequence are achieved by systematically combining the tracking information and visual recognition at each frame.
As an important and challenging problem in computer vision and graphics, keypoint-based object tracking is typically formulated in a spatio-temporal statistical learning framework.
In this work, we model an image as a hypergraph that utilizes a set of hyperedges to capture the contextual properties of image pixels or regions.
In this paper, we propose a new global feature to capture the detailed geometrical distribution of interest points.