With the recent progress in large-scale vision and language representation learning, Vision Language Pre-training (VLP) models have achieved promising improvements on various multi-modal downstream tasks.
Finally, we further propose a hybrid network that is jointly optimized for learning a more generic product representation.
With the prevalence of deep learning based embedding approaches, recommender systems have become a proven and indispensable tool in various information filtering applications.
Recommender systems are popular tools for information retrieval tasks on a large variety of web applications and personalized products.
In this work, we show that this paradigm might inherit the adversarial vulnerability of the centralized neural network, i. e., it has deteriorated performance on adversarial examples when the model is deployed.
We present a visual localization framework based on novel deep attention aware features for autonomous driving that achieves centimeter level localization accuracy.
To this end, we propose the video shuffle, a parameter-free plug-in component that efficiently reallocates the inputs of 2D convolution so that its receptive field can be extended to the temporal dimension.
We present DeepICP - a novel end-to-end learning-based 3D point cloud registration framework that achieves comparable registration accuracy to prior state-of-the-art geometric methods.
The unprecedented demand for large amount of data has catalyzed the trend of combining human insights with machine learning techniques, which facilitate the use of crowdsourcing to enlist label information both effectively and efficiently.
With the increasing demand for large amount of labeled data, crowdsourcing has been used in many large-scale data mining applications.
We describe an attentive encoder that combines tree-structured recursive neural networks and sequential recurrent neural networks for modelling sentence pairs.