We approach this problem through the real-data-free setting in which the model is trained only on 3D simulation data and applied out-of-the-box to a wide variety of real cameras.
Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases.
Ranked #12 on Image Generation on CIFAR-10
Self-supervised pre-training of text representations has been successfully applied to low-resource Neural Machine Translation (NMT).
It has been shown that deep neural networks are prone to overfitting on biased training data.
Recent years have witnessed the rapid progress of generative adversarial networks (GANs).
Ranked #1 on Image Generation on 25% ImageNet 128x128
To address the question, we develop a single-agent reinforced feature selection approach integrated with restructured choice strategy.
In this image generation task, the inputs are a reference image and an instruction in natural language that describes desired modifications to the input image.
Image generation from scene description is a cornerstone technique for the controlled generation, which is beneficial to applications such as content creation and image editing.
In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT).
We refer to our method as SimAug.
Ranked #1 on Trajectory Prediction on ActEV
The first module predicts a graph with complete relations from a graph with user-specified relations.
The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals.
Ranked #1 on Trajectory Forecasting on ForkingPaths
Due to the lack of suitable datasets, previous research has only examined deep learning on controlled synthetic label noise, and real-world label noise has never been studied in a controlled setting.
Ranked #5 on Image Classification on WebVision-1000
Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence.
Multi-target Multi-camera Tracking (MTMCT) aims to extract the trajectories from videos captured by a set of cameras.
We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance.
In this paper, we empirically study this problem and introduce 1) a simple yet effective baseline that achieves promising performance; 2) an easier and practical setting for EmbodiedQA where an agent has a chance to adapt the trained model to a new environment before it actually answers users questions.
In this work we show how one can learn transformations with no training examples by learning them on another domain and then transfer to the target domain.
To facilitate the training, the network is learned with an auxiliary task of predicting future location in which the activity will happen.
Ranked #1 on Trajectory Forecasting on ActEV
Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain.
Ranked #1 on Domain Adaptation on VisDA2017
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image.
Ranked #2 on Image Retrieval with Multi-Modal Query on MIT-States
In addition to a text answer, a few grounding photos are also given to justify the answer.
Ranked #1 on Memex Question Answering on MemexQA
Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering.
Ranked #1 on Memex Question Answering on MemexQA
Recent deep networks are capable of memorizing the entire data even when the labels are completely random.
Ranked #8 on Image Classification on WebVision-1000
We propose a technique that tackles action detection in multimodal videos under a realistic and challenging condition in which only limited training data and partially observed modalities are available.
This paper proposes a new task, MemexQA: given a collection of photos or videos from a user, the goal is to automatically answer questions that help users recover their memory about events captured in the collection.
We report on CMU Informedia Lab's system used in Google's YouTube 8 Million Video Understanding Challenge.
Multimedia event detection has been receiving increasing attention in recent years.
Learning video concept detectors automatically from the big but noisy web data with no additional manual annotations is a novel but challenging area in the multimedia and the machine learning community.
no code implementations • 17 Jun 2016 • Shoou-I Yu, Yi Yang, Zhongwen Xu, Shicheng Xu, Deyu Meng, Zexi Mao, Zhigang Ma, Ming Lin, Xuanchong Li, Huan Li, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann, Chuang Gan, Xingzhong Du, Xiaojun Chang
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search.
As an interesting and emerging topic, co-saliency detection aims at simultaneously extracting common salient objects in a group of images.
Self-paced learning (SPL) is a recently proposed learning regime inspired by the learning process of humans and animals that gradually incorporates easy to more complex samples into training.