At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer.
We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.
The laborious and time-consuming manual annotation has become a real bottleneck in various practical scenarios.
In particular, to solve the inherent ambiguity among four implicit variables, i. e., camera position, shape, texture, and illumination, we study existing works and introduce an explainable structural causal map (SCM) to build our model.
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images.
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
The comparisons of distribution differences between HQ and LQ images can help our model better assess the image quality.
Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting.
To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues.
Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage.
As a by-product, a CapS dataset is constructed by augmenting existing benchmark training set with additional image tags and captions.
Lastly, we compare the performance of our baseline models with BART, a state-of-the-art language model that is effective for summarization.
2 code implementations • 1 Nov 2021 • Guanghua Yu, Qinyao Chang, Wenyu Lv, Chang Xu, Cheng Cui, Wei Ji, Qingqing Dang, Kaipeng Deng, Guanzhong Wang, Yuning Du, Baohua Lai, Qiwen Liu, Xiaoguang Hu, dianhai yu, Yanjun Ma
We investigate the applicability of the anchor-free strategy on lightweight object detection models.
Ranked #1 on Object Detection on MSCOCO
The core is to construct a latent content space for strategy optimization and disentangle the surface style from it.
We conclude with an outlook on how deep learning could shape the future of this new generation of light microscopy technology.
Complex backgrounds and similar appearances between objects and their surroundings are generally recognized as challenging scenarios in Salient Object Detection (SOD).
To our knowledge, our work is the first in producing calibrated predictions under different expertise levels for medical image segmentation.
To fill the research gap, we propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
With the success of deep neural networks in object detection, both WSOD and WSOL have received unprecedented attention.
We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).
State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e. g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment.
Our bidirectional dynamic fusion strategy encourages the interaction of spatial and temporal information in a dynamic manner.
Ranked #13 on Video Polyp Segmentation on SUN-SEG-Easy
We notice that some real-world QA tasks are more complex, which cannot be solved by end-to-end neural networks or translated to any kind of formal representations.
It is common for people to create different types of charts to explore a multi-dimensional dataset (table).
The explicitly extracted edge information goes together with saliency to give more emphasis to the salient regions and object boundaries.
Ranked #18 on RGB-D Salient Object Detection on NJU2K
no code implementations • 30 Apr 2020 • Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller
In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety.
As a fundamental and challenging problem in computer vision, hand pose estimation aims to estimate the hand joint locations from depth images.