Video compression is a central feature of the modern internet powering technologies from social media to video conferencing.
Synthesizing images of a person in novel poses from a single image is a highly ambiguous task.
Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities.
Urban material recognition in remote sensing imagery is a highly relevant, yet extremely challenging problem due to the difficulty of obtaining human annotations, especially on low resolution satellite images.
In this paper, we introduce VideoLT, a large-scale long-tailed video recognition dataset, as a step toward real-world video recognition.
We present a novel architecture for 3D object detection, M3DeTR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids.
Ranked #1 on 3D Object Detection on KITTI Cars Hard val
The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label.
Lastly, we study different attention architectures in the discriminator, and propose a reference attention mechanism.
The recently proposed Lottery Ticket Hypothesis (LTH) states that deep neural networks trained on large datasets contain smaller subnetworks that achieve on par performance as the dense networks.
Thus, the motion features at higher levels are trained to gradually capture semantic dynamics and evolve more discriminative for action recognition.
Compared to classification networks, attention visualization for retrieval networks is hardly studied.
The widely adopted sequential variant of Non Maximum Suppression (or Greedy-NMS) is a crucial module for object-detection pipelines.
Generating a new layout or extending an existing layout requires understanding the relationships between these primitives.
The JPEG image compression algorithm is the most popular method of image compression because of its ability for large compression ratios.
We present a systematic study of adversarial attacks on state-of-the-art object detection frameworks.
We address the problem of distance metric learning in visual similarity search, defined as learning an image embedding model which projects images into Euclidean space where semantically and visually similar images are closer and dissimilar images are further from one another.
Then we train a generator to transform an input image along with a style-code to the output domain.
In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector---a progressive learning framework for spatio-temporal action detection in videos.
Ranked #7 on Action Detection on UCF101-24
We introduce an unsupervised formulation to estimate heteroscedastic uncertainty in retrieval systems.
We employ triplet loss as a feature embedding regularizer to boost classification performance.
We cast visual retrieval as a regression problem by posing triplet loss as a regression loss.
We introduce a general method of performing Residual Network inference and learning in the JPEG transform domain that allows the network to consume compressed images as input.
Our experiments show that (1) GANs carry distinct model fingerprints and leave stable fingerprints in their generated images, which support image attribution; (2) even minor differences in GAN training can result in different fingerprints, which enables fine-grained model authentication; (3) fingerprints persist across different image frequencies and patches and are not biased by GAN artifacts; (4) fingerprint finetuning is effective in immunizing against five types of adversarial image perturbations; and (5) comparisons also show our learned fingerprints consistently outperform several baselines in a variety of setups.
The self-supervised pre-trained weights effectiveness is validated on the action recognition task.
The classification system further classifies the generated candidates based on opinions of multiple deep verification networks and a fusion network which utilizes a novel soft-rejection fusion method to adjust the confidence in the detection results.
In this paper, we propose the first dedicated end-to-end deep learning approach for motion boundary detection, which we term as MoBoNet.
To accomplish this, we use a submodular set function to model the accuracy achievable on a new task when the features have been learned on a given subset of classes of the source dataset.
In this paper, we introduce the Face Magnifier Network (Face-MageNet), a face detector based on the Faster-RCNN framework which enables the flow of discriminative information of small scale faces to the classifier without any skip or residual connections.
Compared to the general semantic segmentation problem, portrait segmentation has higher precision requirement on boundary area.
While computers can now describe what is explicitly depicted in natural images, in this paper we examine whether they can understand the closure-driven narratives conveyed by stylized artwork and dialogue in comic book panels.
Typical textual descriptions that accompany online videos are 'weak': i. e., they mention the main concepts in the video but not their corresponding spatio-temporal locations.
Deep Convolutional Neural Networks (CNN) enforces supervised information only at the output layer, and hidden layers are trained by back propagating the prediction error from the output layer without explicit supervision.
This paper presents a structured ordinal measure method for video-based face recognition that simultaneously learns ordinal filters and structured ordinal features.