While (Lu and Sa, 2021) have recently provided an optimal rate for non-convex stochastic decentralized optimization with weight matrices defined over linear graphs, the optimal rate with general weight matrices remains unclear.
In this paper, we propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
Ranked #16 on Semi-Supervised Video Object Segmentation on MOSE
To address this challenge, we propose Multi-View Consistent Generative Adversarial Networks (MVCGAN) for high-quality 3D-aware image synthesis with geometry constraints.
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance.
Ranked #9 on Video Retrieval on MSR-VTT-1kA (using extra training data)
Talking gesture generation is a practical yet challenging task which aims to synthesize gestures in line with speech.
Ranked #6 on Gesture Generation on TED Gesture Dataset
Experimental results on a variety of tasks and models demonstrate that decentralized (momentum) SGD over exponential graphs promises both fast and high-quality training.
Decentralized adaptive gradient methods, in which each node averages only with its neighbors, are critical to save communication and wall-clock training time in deep learning tasks.
Different from all of them, we regard large and small gradients selection as the exploitation and exploration of gradient information, respectively.
Communication overhead hinders the scalability of large-scale distributed training.
Specifically, we expect to approximate the real joint distribution over the partial observation and latent variables, thus infer the unseen targets respectively.
Experimental results on a variety of computer vision tasks and models demonstrate that DecentLaM promises both efficient and high-quality training.
To address this limitation, we propose to Learn position and target Consistency framework for Memory-based video object segmentation, termed as LCM.
Recent works have shown that convolutional networks have substantially improved the performance of multiple object tracking by simultaneously learning detection and appearance features.
First, we adopt a simple but effective decoupled learning strategy of representations and classifiers that only the classifiers are updated in each incremental session, which avoids knowledge forgetting in the representations.
Ranked #6 on Few-Shot Class-Incremental Learning on mini-Imagenet
A key challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias.
In the last decades, extreme classification has become an essential topic for deep learning.
Nowadays, live-stream and short video shopping in E-commerce have grown exponentially.
For a deployed visual search system with several billions of online images in total, building a billion-scale offline graph in hours is essential, which is almost unachievable by most existing methods.
Researches have demonstrated that low bit-width (e. g., INT8) quantization can be employed to accelerate the inference process.
However, scaling up the classification task from thousands of semantic labels to millions of instance labels brings specific challenges including 1) the large-scale softmax computation; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy.
In this paper, we present a novel side information based large scale visual recognition co-training~(SICoT) system to deal with the long tail problem by leveraging the image related side information.
Benefiting from exploration of user click data, our networks are more effective to encode richer supervision and better distinguish real-shot images in terms of category and feature.
In many real-world datasets, like WebVision, the performance of DNN based classifier is often limited by the noisy labeled data.
Given a single depth image, our method first goes through the 3D volume branch to obtain a volumetric scene reconstruction as a guide to the next view inpainting step, which attempts to make up the missing information; the third step involves projecting the volume under the same view of the input, concatenating them to complete the current view depth, and integrating all depth into the point cloud.