It is hypothesized that one’s interests in a hashtag are related with what they said before (user history) and the existing posts present the hashtag (hashtag contexts).
To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities.
Aiming to fundamentally improve the depth estimation quality for colonoscopy 3D reconstruction, in this work we have designed a set of training losses to deal with the special challenges of colonoscopy data.
Based on the frequency principle on GNNs, we present a novel powerful GNNs framework, Multi-Scale Frequency Enhanced Graph Neural Networks (MSF-GNNs) which considers multi-scale representations from wavelet decomposition.
In this work we focus on the lighting problem in colonoscopy videos.
To help with identifiability, we develop an advection-diffusion simulator which allows pre-training of our model by supervised learning using the velocity and diffusion tensor fields.
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.
To address this issue, we propose a `safety score' as a primary metric for measuring the level of safety in AV computing system design.
A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.
Concretely, our approach uses two convolutional neural networks: (1) a gesture network that uses pre-defined motion information to detect the hand region; and (2) an appearance network that learns a person specific model of the hand region based on the output of the gesture network.