In this work, we propose a new cross-dataset 3D object detection method named Scale-aware and Range-aware Domain Adaptation Network (SRDAN).
Furthermore, to achieve variable bitrate decoding with one single decoder, we propose a bitrate adaptive module to project the representation from a base bitrate to the expected representation at a target bitrate for transmission.
In this work, we propose a feature-space video coding network (FVC) by performing all major operations (i. e., motion estimation, motion compression, motion compensation and residual compression) in the feature space.
Inspired by the back-tracing strategy in the conventional Hough voting methods, in this work, we introduce a new 3D object detection method, named as Back-tracing Representative Points Network (BRNet), which generatively back-traces the representative points from the vote centers and also revisits complementary seed points around these generated points, so as to better capture the fine local structural features surrounding the potential objects from the raw point clouds.
Ranked #2 on 3D Object Detection on ScanNetV2
On the other hand, we also design an effective distribution alignment method to reduce the distribution divergence between the virtual domain and the target domain by gradually improving the compactness of the target domain distribution through model learning.
2) Based on the DFA features, we introduce the integrity channel enhancement (ICE) component with the goal of enhancing feature channels that highlight the integral salient objects at the macro level, while suppressing the other distracting ones.
In this work, we propose a new feed-forward arbitrary style transfer method, referred to as StyleFormer, which can simultaneously fulfill fine-grained style diversity and semantic content coherency.
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube of a target object in an untrimmed video based on a query sentence.
Visual grounding on 3D point clouds is an emerging vision and language task that benefits various applications in understanding the 3D visual world.
To develop a practical method for learning complex inception convolution based on the data, a simple but effective search algorithm, referred to as efficient dilation optimization (EDO), is developed.
HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.
We collected 100, 000 shoeprints of subjects ranging from 7 to 80 years old and used the data to develop a deep learning end-to-end model ShoeNet to analyze age-related patterns and predict age.
Registration of 3D LiDAR point clouds with optical images is critical in the combination of multi-source data.
This paper designs a technique route to generate high-quality panoramic image with depth information, which involves two critical research hotspots: fusion of LiDAR and image data and image stitching.
In this work, we propose a new framework called Resolution-adaptive Flow Coding (RaFC) to effectively compress the flow maps globally and locally, in which we use multi-resolution representations instead of single-resolution representations for both the input flow maps and the output motion features of the MV encoder.
We propose a multi-exit evacuation simulation based on Deep Reinforcement Learning (DRL), referred to as the MultiExit-DRL, which involves in a Deep Neural Network (DNN) framework to facilitate state-to-action mapping.
Therefore, the encoder is adaptive to different video contents and achieves better compression performance by reducing the domain gap between the training and testing datasets.
To this end, we propose a new strategy to suppress the influence of unimportant features (i. e., the features will be removed at the next pruning stage).
Our EDIC method can also be readily incorporated with the Deep Video Compression (DVC) framework to further improve the video compression performance.
The results of this study prove the possibility of multispectral-to-nighttime translation and further indicate that, with the additional social media data, the generated nighttime imagery can be very similar to the ground-truth imagery.
For example, given two image domains $X_1$ and $X_2$ with certain attributes, the intersection $X_1 \cap X_2$ denotes a new domain where images possess the attributes from both $X_1$ and $X_2$ domains.
Detecting inaccurate smart meters and targeting them for replacement can save significant resources.
Specifically, we first generate a larger set of region proposals by combining the latest region proposals from both streams, from which we can readily obtain a larger set of labelled training samples to help learn better action detection models.
Nevertheless, it is difficult for RNN based models to capture the information about long-range dependency among words in the sentences of questions and answers.
HAS adopts a hashing strategy to learn a binary matrix representation for each answer, which can dramatically reduce the memory cost for storing the matrix representations of answers.
Conventional video compression approaches use the predictive coding architecture and encode the corresponding motion information and residual information.
We then train view-specific action classifiers based on the view-specific representation for each view and a view classifier based on the shared representation at lower layers.
In this paper, we model the video artifact reduction task as a Kalman filtering procedure and restore decoded frames through a deep Kalman filtering network.
In this paper, we propose a new unsupervised domain adaptation approach called Collaborative and Adversarial Network (CAN) through domain-collaborative and domain-adversarial training of neural networks.
relevant) to the given event class, we formulate this task as a multi-instance learning (MIL) problem by taking each video as a bag and the video shots in each video as instances.
Results: Here, a very deep neural network, the deep inception-inside-inception networks (Deep3I), is proposed for protein secondary structure prediction and a software tool was implemented using this network.
Object segmentation in weakly labelled videos is an interesting yet challenging task, which aims at learning to perform category-specific video object segmentation by only using video-level tags.
Skeleton-based human action recognition has attracted a lot of research attention during the past few years.
Ranked #4 on One-Shot 3D Action Recognition on NTU RGB+D 120
This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition.
Two fundamental building blocks or generating functions (GFs) for invariants are discovered, which are dot product and vector product of point vectors in Euclidean space.
The proposed MldrNet combines deep representations of different levels, i. e. image semantics, image aesthetics, and low-level visual features to effectively classify the emotion types of different kinds of images, such as abstract paintings and web images.
Matching pedestrians across multiple camera views known as human re-identification (re-identification) is a challenging problem in visual surveillance.
To handle the noise and occlusion in 3D skeleton data, we introduce new gating mechanism within LSTM to learn the reliability of the sequential input data and accordingly adjust its effect on updating the long-term context information stored in the memory cell.
Ranked #8 on Skeleton Based Action Recognition on SBU
Hence, these existing models don't put supervision (loss or similarity calculation) at every time step, which will lose some useful information.
Trace-norm regularization plays an important role in many areas such as machine learning and computer vision.
Can we obtain dimensionality-dependent generalization bounds for $k$-dimensional coding schemes that are tighter than dimensionality-independent bounds when data is in a finite-dimensional feature space?
Considering the recent works show the domain generalization capability can be enhanced by fusing multiple SVM classifiers, we build upon exemplar SVMs to learn a set of SVM classifiers by using one positive sample and all negative samples in the source domain each time.
In this paper, we develop a fast LRR solver called FaLRR, by reformulating LRR as a new optimization problem with regard to factorized data (which is obtained by skinny SVD of the original data matrix).
We present an object-based co-segmentation method that takes advantage of depth data and is able to correctly handle noisy images in which the common foreground object is missing.
In this work, we formulate a new weakly supervised domain generalization problem for the visual recognition task by using loosely labeled web images/videos as training data.
Nuclear-norm regularization plays a vital role in many learning tasks, such as low-rank matrix recovery (MR), and low-rank representation (LRR).
In this work, we propose a new framework for recognizing RGB images captured by the conventional cameras by leveraging a set of labeled RGB-D data, in which the depth features can be additionally extracted from the depth images.
We present a video co-segmentation method that uses category-independent object proposals as its basic element and can extract multiple foreground objects in a video set.
Our framework is motivated by the observation that samples from the same class repetitively appear in the collection of ambiguously labeled training images, while they are just ambiguously labeled in each image.
In this work, we propose to leverage a large number of loosely labeled web videos (e. g., from YouTube) and web images (e. g., from Google/Bing image search) for visual event recognition in consumer videos without requiring any labeled consumer videos.