We validated our method on domain adaptation of hand segmentation from real and simulation images.
Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images.
Hence, we consider a new realistic setting called Noisy UniDA, in which classifiers are trained with noisy labeled data from the source domain and unlabeled data with an unknown class distribution from the target domain.
Visual grounding is provided as bounding boxes to image sequences of recipes, and each bounding box is linked to an element of the workflow.
To address this task, we have developed the patch-based density forecasting network (PDFN), which enables forecasting over a sequence of crowd density maps describing how crowded each location is in each video frame.
This work addresses a new problem that learns generative adversarial networks (GANs) from multiple data collections that are each i) owned separately by different clients and ii) drawn from a non-identical distribution that comprises different classes.
In this paper, we propose to leverage graph optimization and loop closure detection to overcome limitations of unsupervised learning based monocular visual odometry.
This motivates us to propose a novel method for detector adaptation based on strong local alignment and weak global alignment.
Ranked #2 on Unsupervised Domain Adaptation on SIM10K to BDD100K
The other is a multitask learning approach that uses depth images as outputs.
The results demonstrate that CFT-GAN is able to successfully generate videos containing the action and appearances indicated in the captions.
Moreover, we regard that sentences that are easily understood are those that are comprehended correctly and quickly by humans.
To remedy this, we propose a novel family of GANs called label-noise robust GANs (rGANs), which, by incorporating a noise transition model, can learn a clean label conditional generative distribution even when training labels are noisy.
To overcome this limitation, we address a novel problem called class-distinct and class-mutual image generation, in which the goal is to construct a generator that can capture between-class relationships and generate an image selectively conditioned on the class specificity.
In this paper, we propose a method for generating questions about unknown objects in an image, as means to get information about classes that have not been learned.
Almost all of them are proposed for a closed-set scenario, where the source and the target domain completely share the class of their samples.
Image description task has been invariably examined in a static manner with qualitative presumptions held to be universally applicable, regardless of the scope or target of the description.
To satisfy these requirements (A)-(C) simultaneously, we proposed a novel video summarization method from multiple groups of videos.
To solve these problems, we introduce a new approach that attempts to align distributions of source and target by utilizing the task-specific decision boundaries.
Ranked #3 on Domain Adaptation on HMDBfull-to-UCF
Second, we propose a mixing method that treats the images as waveforms, which leads to a further improvement in performance.
FlowGAN generates optical flow, which contains only the edge and motion of the videos to be begerated.
Using this renderer, we perform single-image 3D mesh reconstruction with silhouette image supervision and our system outperforms the existing voxel-based approach.
Ranked #6 on 3D Object Reconstruction on Data3D−R2N2 (Avg F1 metric)
However, a drawback of this approach is that the critic simply labels the generated features as in-domain or not, without considering the boundaries between classes.
Ranked #2 on Synthetic-to-Real Translation on Syn2Real-C
Automatic melody generation for pop music has been a long-time aspiration for both AI researchers and musicians.
Sound Multimedia Audio and Speech Processing
In this paper, we address the problem of spatio-temporal person retrieval from multiple videos using a natural language query, in which we output a tube (i. e., a sequence of bounding boxes) which encloses the person described by the query.
Deep-layered models trained on a large number of labeled samples boost the accuracy of many tasks.
Ranked #5 on Sentiment Analysis on Multi-Domain Sentiment Dataset
To obtain the common representations under such a situation, we propose to make the distributions over different modalities similar in the learned representations, namely modality-invariant representations.
Visual Question Answering (VQA) task has showcased a new stage of interaction between language and vision, two of the most pivotal components of artificial intelligence.
Visual question answering (VQA) task not only bridges the gap between images and language, but also requires that specific contents within the image are understood as indicated by linguistic context of the question, in order to generate the accurate answers.
In order to overcome the shortage of training samples, CoSMoS obtains a subspace in which (a) all feature vectors associated with the same phrase are mapped as mutually close, (b) classifiers for each phrase are learned, and (c) training samples are shared among co-occurring phrases.
In this paper, we would like to evaluate online learning algorithms for large-scale visual recognition using state-of-the-art features which are preselected and held fixed.