We approach this problem through the real-data-free setting in which the model is trained only on 3D simulation data and applied out-of-the-box to a wide variety of real cameras.
Ranked #1 on Trajectory Forecasting on ActEV
Recent research in representation learning has shown that hierarchical data lends itself to low-dimensional and highly informative representations in hyperbolic space.
The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention.
Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
Learning to hallucinate additional examples has recently been shown as a promising direction to address few-shot learning tasks.
The experimental results show that the STAN model can consistently improve the state of the arts in both action detection and action recognition tasks.
Furthermore, the classification of information in real-time systems requires training on out-of-domain data, as we do not have any data from a new emerging crisis.
The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs.
We propose an improved discriminative model prediction method for robust long-term tracking based on a pre-trained short-term tracker.
Facial image retrieval plays a significant role in forensic investigations where an untrained witness tries to identify a suspect from a massive pool of images.
This provides the first benchmark for quantitative evaluation of models to assess building damage using aerial videos.
In this paper, we investigate how to utilize visual content for disambiguation and promoting latent space alignment in unsupervised MMT.
We refer to our method as SimAug.
Ranked #2 on Trajectory Prediction on ActEV
An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos.
We have interesting results both in bounding the shooter as well as detecting the gun smoke.
The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals.
Ranked #1 on Multi-future Trajectory Prediction on ForkingPaths
With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations.
By minimizing the mutual information, each column is guided to learn features with different image scales.
Although the Maximum Excess over SubArrays (MESA) loss has been previously proposed to address the above issues by finding the rectangular subregion whose predicted density map has the maximum difference from the ground truth, it cannot be solved by gradient descent, thus can hardly be integrated into the deep learning framework.
Ranked #4 on Crowd Counting on ShanghaiTech B
The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.
The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.
Among other uses, VERA enables the localization of a shooter from just a few videos that include the sound of gunshots.
The task of retrieving clips within videos based on a given natural language query requires cross-modal reasoning over multiple frames.
To facilitate the training, the network is learned with an auxiliary task of predicting future location in which the activity will happen.
Ranked #1 on Activity Prediction on ActEV
Therefore, we developed a model to predict and identify car crashes from surveillance cameras based on a 3D reconstruction of the road plane and prediction of trajectories.
To tackle this challenge, we present a novel pipeline comprised of an Observer Engine and a Physicist Engine by respectively imitating the actions of an observer and a physicist in the real world.
Our experiments indicate a considerable improvement in object detection accuracy: +8. 51% for CM and +6. 20% for ACM.
Moments capture a huge part of our lives.
In this work, we explore the cross-scale similarity in crowd counting scenario, in which the regions of different scales often exhibit high visual similarity.
This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).
Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering.
Ranked #1 on Memex Question Answering on MemexQA
A key problem in deep multi-attribute learning is to effectively discover the inter-attribute correlation structures.
For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.
This paper proposes a new task, MemexQA: given a collection of photos or videos from a user, the goal is to automatically answer questions that help users recover their memory about events captured in the collection.
Learning video concept detectors automatically from the big but noisy web data with no additional manual annotations is a novel but challenging area in the multimedia and the machine learning community.
The tracker is formulated as a quadratic optimization problem with L0 norm constraints, which we propose to solve with the solution path algorithm.
First, we propose a two-stream Stacked Convolutional Independent Subspace Analysis (ConvISA) architecture to show that unsupervised learning methods can significantly boost the performance of traditional local features extracted from data-independent models.
Self-paced learning (SPL) is a recently proposed learning regime inspired by the learning process of humans and animals that gradually incorporates easy to more complex samples into training.
A device just like Harry Potter's Marauder's Map, which pinpoints the location of each person-of-interest at all times, provides invaluable information for analysis of surveillance videos.