no code implementations • 20 Sep 2023 • Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie
To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1. 9M audio-text pairs.
1 code implementation • 13 Sep 2023 • Jiayu Lei, Lisong Dai, Haoyun Jiang, Chaoyi Wu, Xiaoman Zhang, Yao Zhang, Jiangchao Yao, Weidi Xie, Yanyong Zhang, Yuehua Li, Ya zhang, Yanfeng Wang
Magnetic resonance imaging~(MRI) have played a crucial role in brain disease diagnosis, with which a range of computer-aided artificial intelligence methods have been proposed.
no code implementations • 7 Sep 2023 • Hala Lamdouar, Weidi Xie, Andrew Zisserman
We also incorporate the proposed camouflage score into a generative model as an auxiliary loss and show that effective camouflage images or videos can be synthesised in a scalable manner.
1 code implementation • 16 Aug 2023 • Fangrui Zhu, Yiming Xie, Weidi Xie, Huaizu Jiang
Although we have witnessed significant progress in human-object interaction (HOI) detection with increasingly high mAP (mean Average Precision), a single mAP score is too concise to obtain an informative summary of a model's performance and to understand why one approach is better than another.
1 code implementation • 9 Aug 2023 • Qingyao Xu, Weibo Mao, Jingze Gong, Chenxin Xu, Siheng Chen, Weidi Xie, Ya zhang, Yanfeng Wang
Multi-person motion prediction is a challenging problem due to the dependency of motion on both individual past movements and interactions with other people.
1 code implementation • 4 Aug 2023 • Chaoyi Wu, Xiaoman Zhang, Ya zhang, Yanfeng Wang, Weidi Xie
In this study, we aim to initiate the development of Radiology Foundation Model, termed as RadFM. We consider the construction of foundational models from the perspectives of dataset construction, model design, and thorough evaluation.
1 code implementation • 24 Jun 2023 • HaoNing Wu, Xiaoyun Zhang, Weidi Xie, Ya zhang, Yanfeng Wang
Video frame interpolation (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video.
1 code implementation • 13 Jun 2023 • Gyungin Shin, Weidi Xie, Samuel Albanie
In this paper, we propose to meet this challenge through the novel task of automatic table verification (AutoTV), in which the objective is to verify the accuracy of numerical data in tables by cross-referencing cited sources.
no code implementations • 12 Jun 2023 • Yikun Liu, Jiangchao Yao, Ya zhang, Yanfeng Wang, Weidi Xie
In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e. g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
no code implementations • 8 Jun 2023 • Prannay Kaul, Weidi Xie, Andrew Zisserman
The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining.
1 code implementation • 1 Jun 2023 • Chang Liu, HaoNing Wu, Yujie Zhong, Xiaoyun Zhang, Weidi Xie
Generative models have recently exhibited exceptional capabilities in various scenarios, for example, image generation based on text description.
no code implementations • 18 May 2023 • Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya zhang, Weidi Xie
The objective of Audio-Visual Segmentation (AVS) is to localise the sounding objects within visual scenes by accurately predicting pixel-wise segmentation masks.
2 code implementations • 17 May 2023 • Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya zhang, Yanfeng Wang, Weidi Xie
In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information.
Ranked #1 on
Medical Visual Question Answering
on VQA-RAD
1 code implementation • 27 Apr 2023 • Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya zhang, Yanfeng Wang, Weidi Xie
Our contributions are threefold: (i) we systematically investigate the process of adapting a general-purpose foundation language model towards medical domain, this involves data-centric knowledge injection through the integration of 4. 8M biomedical academic papers and 30K medical textbooks, as well as comprehensive fine-tuning for alignment with domain-specific instructions; (ii) we contribute a large-scale, comprehensive dataset for instruction tuning.
1 code implementation • 27 Apr 2023 • Gyungin Shin, Samuel Albanie, Weidi Xie
Segmentation is a core computer vision competency, with applications spanning a broad range of scientifically and economically valuable domains.
1 code implementation • 4 Apr 2023 • Haochen Wang, Cilin Yan, Shuai Wang, XiaoLong Jiang, Xu Tang, Yao Hu, Weidi Xie, Efstratios Gavves
Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos.
1 code implementation • CVPR 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
1 code implementation • CVPR 2023 • Yue Hu, Yifan Lu, Runsheng Xu, Weidi Xie, Siheng Chen, Yanfeng Wang
Camera-only 3D detection provides an economical solution with a simple configuration for localizing objects in 3D space compared to LiDAR-based detection systems.
no code implementations • 21 Mar 2023 • Chen Ju, Zeqian Li, Peisen Zhao, Ya zhang, Xiaopeng Zhang, Qi Tian, Yanfeng Wang, Weidi Xie
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time.
1 code implementation • 13 Mar 2023 • Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya zhang, Yanfeng Wang, Weidi Xie
Foundation models trained on large-scale dataset gain a recent surge in CV and NLP.
Ranked #3 on
Medical Visual Question Answering
on PMC-VQA
no code implementations • 27 Feb 2023 • Xiaoman Zhang, Chaoyi Wu, Ya zhang, Yanfeng Wang, Weidi Xie
While multi-modal foundation models pre-trained on large-scale data have been successful in natural language understanding and vision recognition, their use in medical domains is still limited due to the fine-grained nature of medical tasks and the high demand for domain knowledge.
no code implementations • 22 Feb 2023 • Chaoyi Wu, Xiaoman Zhang, Yanfeng Wang, Ya zhang, Weidi Xie
In this paper, we consider the problem of disease diagnosis.
1 code implementation • CVPR 2023 • Keyan Chen, XiaoLong Jiang, Yao Hu, Xu Tang, Yan Gao, Jianqi Chen, Weidi Xie
In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario.
Ranked #1 on
Open Vocabulary Attribute Detection
on OVAD benchmark
(using extra training data)
1 code implementation • CVPR 2023 • Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie
The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities.
no code implementations • 12 Jan 2023 • Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya zhang, Yanfeng Wang, Weidi Xie
The goal of this paper is to extract the visual-language correspondence from a pre-trained text-to-image diffusion model, in the form of segmentation map, i. e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt.
no code implementations • 5 Jan 2023 • Chaoyi Wu, Xiaoman Zhang, Ya zhang, Yanfeng Wang, Weidi Xie
In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice.
1 code implementation • 27 Oct 2022 • Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya zhang, Weidi Xie
When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
1 code implementation • 18 Oct 2022 • Guanqi Zhan, Weidi Xie, Andrew Zisserman
To this end we make the following four contributions: (1) We propose a simple 'plugin' module for the detection head of two-stage object detectors to improve the recall of partially occluded objects.
Ranked #1 on
Instance Segmentation
on Separated COCO
1 code implementation • 13 Oct 2022 • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space.
no code implementations • 10 Oct 2022 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is an efficient training method for video tasks.
no code implementations • 7 Oct 2022 • Qinye Zhou, Ziyi Li, Weidi Xie, Xiaoyun Zhang, Ya zhang, Yanfeng Wang
Existing models on super-resolution often specialized for one scale, fundamentally limiting their use in practical scenarios.
no code implementations • 1 Oct 2022 • Shuangrui Ding, Weidi Xie, Yabo Chen, Rui Qian, Xiaopeng Zhang, Hongkai Xiong, Qi Tian
In this paper, we consider the task of unsupervised object discovery in videos.
Ranked #3 on
Unsupervised Object Segmentation
on DAVIS 2016
1 code implementation • 22 Sep 2022 • Gyungin Shin, Weidi Xie, Samuel Albanie
Our method, termed NamedMask, begins by using CLIP to construct category-specific archives of images.
no code implementations • 12 Sep 2022 • Pak-Hei Yeung, Moska Aliasi, Monique Haak, the INTERGROWTH-21st Consortium, Weidi Xie, Ana I. L. Namburete
Two-dimensional (2D) freehand ultrasound is the mainstay in prenatal care and fetal growth monitoring.
1 code implementation • 29 Aug 2022 • Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie
In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i. e. zero-shot or few-shot counting.
Ranked #2 on
Object Counting
on FSC147
no code implementations • 20 Aug 2022 • Wentao Liu, Chaofan Ma, Yuhuan Yang, Weidi Xie, Ya zhang
The goal of this paper is to interactively refine the automatic segmentation on challenging structures that fall behind human performance, either due to the scarcity of available annotations or the difficulty nature of the problem itself, for example, on segmenting cancer or small organs.
no code implementations • 8 Aug 2022 • Yue Hu, Shaoheng Fang, Weidi Xie, Siheng Chen
To fill the gap, this work proposes a dual-view detection system named DVDET to achieve aerial monocular object detection in both the 2D image space and the 3D physical space.
1 code implementation • 5 Jul 2022 • Junyu Xie, Weidi Xie, Andrew Zisserman
The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video.
Ranked #3 on
Unsupervised Object Segmentation
on FBMS-59
no code implementations • 26 Jun 2022 • Jinxiang Liu, Chen Ju, Weidi Xie, Ya zhang
We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos.
2 code implementations • 14 Jun 2022 • Gyungin Shin, Weidi Xie, Samuel Albanie
Semantic segmentation has a broad range of applications, but its real-world impact has been significantly limited by the prohibitive annotation costs necessary to enable deployment.
Retrieval
Unsupervised Semantic Segmentation with Language-image Pre-training
1 code implementation • 14 Jun 2022 • Ziheng Zhao, Tianjiao Zhang, Weidi Xie, Yanfeng Wang, Ya zhang
This paper considers the problem of undersampled MRI reconstruction.
1 code implementation • CVPR 2022 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.
2 code implementations • 30 Mar 2022 • Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, Lin Ma
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
1 code implementation • 23 Mar 2022 • Gyungin Shin, Samuel Albanie, Weidi Xie
In this paper, we tackle the challenging task of unsupervised salient object detection (SOD) by leveraging spectral clustering on self-supervised features.
Ranked #2 on
Unsupervised Saliency Detection
on DUTS
1 code implementation • CVPR 2022 • Prannay Kaul, Weidi Xie, Andrew Zisserman
The objective of this paper is few-shot object detection (FSOD) -- the task of expanding an object detector for a new category given only a few instances for training.
no code implementations • 8 Dec 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
1 code implementation • 8 Dec 2021 • Chen Ju, Tengda Han, Kunhao Zheng, Ya zhang, Weidi Xie
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation.
Ranked #5 on
Zero-Shot Action Detection
on ActivityNet-1.3
no code implementations • CVPR 2022 • Charig Yang, Weidi Xie, Andrew Zisserman
In this paper, we present a framework for reading analog clocks in natural images or videos.
no code implementations • 24 Sep 2021 • Pak-Hei Yeung, Linde Hesse, Moska Aliasi, Monique Haak, the INTERGROWTH-21st Consortium, Weidi Xie, Ana I. L. Namburete
The objective of this work is to achieve sensorless reconstruction of a 3D volume from a set of 2D freehand ultrasound images with deep implicit representation.
no code implementations • 7 Sep 2021 • Xiaoman Zhang, Weidi Xie, Chaoqin Huang, Yanfeng Wang, Ya zhang, Xin Chen, Qi Tian
In this paper, we target self-supervised representation learning for zero-shot tumor segmentation.
1 code implementation • 26 May 2021 • Pak-Hei Yeung, Ana I. L. Namburete, Weidi Xie
The objective of this work is to segment any arbitrary structures of interest (SOI) in 3D volumes by only annotating a single slice, (i. e. semi-automatic 3D segmentation).
no code implementations • ICCV 2021 • Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, Weidi Xie
We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models.
Ranked #7 on
Unsupervised Object Segmentation
on DAVIS 2016
1 code implementation • 13 Apr 2021 • Gyungin Shin, Weidi Xie, Samuel Albanie
A central challenge for the task of semantic segmentation is the prohibitive cost of obtaining dense pixel-level annotations to supervise model training.
1 code implementation • CVPR 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
2 code implementations • 26 Mar 2021 • Ben Jaderberg, Lewis W. Anderson, Weidi Xie, Samuel Albanie, Martin Kiffner, Dieter Jaksch
The resurgence of self-supervised learning, whereby a deep learning model generates its own supervisory signal from the data, promises a scalable way to tackle the dramatically increasing size of real-world data sets without human annotation.
5 code implementations • 14 Feb 2021 • ZiRui Wang, Shangzhe Wu, Weidi Xie, Min Chen, Victor Adrian Prisacariu
Considering the problem of novel view synthesis (NVS) from only a set of 2D images, we simplify the training process of Neural Radiance Field (NeRF) on forward-facing scenes by removing the requirement of known or pre-computed camera parameters, including both intrinsics and 6DoF poses.
no code implementations • 12 Dec 2020 • Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.
no code implementations • 23 Nov 2020 • Hala Lamdouar, Charig Yang, Weidi Xie, Andrew Zisserman
We make the following three contributions: (i) We propose a novel architecture that consists of two essential components for breaking camouflage, namely, a differentiable registration module to align consecutive frames based on the background, which effectively emphasises the object boundary in the difference image, and a motion segmentation module with memory that discovers the moving objects, while maintaining the object permanence even when motion is absent at some point.
1 code implementation • NeurIPS 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is visual-only self-supervised video representation learning.
Ranked #12 on
Self-Supervised Action Recognition
on HMDB51 (finetuned)
1 code implementation • 16 Sep 2020 • Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, Michael Rubinstein
We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur.
no code implementations • 1 Sep 2020 • Weidi Xie, Jeffrey Byrne, Andrew Zisserman
We describe three use cases on the public IJB-C face verification benchmark: (i) to improve 1:1 image-based verification error rates by rejecting low-quality face images; (ii) to improve quality score based fusion performance on the 1:1 set-based verification benchmark; and (iii) its use as a quality measure for selecting high quality (unblurred, good lighting, more frontal) faces from a collection, e. g. for automatic enrolment or display.
1 code implementation • ECCV 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition.
2 code implementations • ECCV 2020 • Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman
Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods.
Ranked #4 on
Vehicle Re-Identification
on VehicleID Medium
no code implementations • 22 Jun 2020 • Fangrui Zhu, Li Zhang, Yanwei Fu, Guodong Guo, Weidi Xie
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a. k. a.
One-shot visual object segmentation
Representation Learning
+2
2 code implementations • 29 Apr 2020 • Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques.
2 code implementations • CVPR 2020 • Zihang Lai, Erika Lu, Weidi Xie
Recent interest in self-supervised dense tracking has yielded rapid progress, but performance still remains far from supervised methods.
Ranked #4 on
Unsupervised Video Object Segmentation
on DAVIS 2017 (val)
(using extra training data)
Semantic Segmentation
Semi-Supervised Video Object Segmentation
+2
no code implementations • 5 Dec 2019 • Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman
The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.
1 code implementation • 10 Sep 2019 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.
Ranked #32 on
Self-Supervised Action Recognition
on UCF101
Representation Learning
Self-Supervised Action Recognition
+2
no code implementations • 6 Sep 2019 • Dan Xu, Weidi Xie, Andrew Zisserman
In this paper we propose a geometry-aware model for video object detection.
no code implementations • 14 Aug 2019 • Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
We propose AutoCorrect, a method to automatically learn object-annotation alignments from a dataset with annotations affected by geometric noise.
1 code implementation • 2 May 2019 • Zihang Lai, Weidi Xie
Fourth, in order to shed light on the potential of self-supervised learning on the task of video correspondence flow, we probe the upper bound by training on additional data, \ie more diverse videos, further demonstrating significant improvements on video segmentation.
Self-Supervised Learning
Semi-Supervised Video Object Segmentation
+4
9 code implementations • 26 Feb 2019 • Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.
1 code implementation • 1 Nov 2018 • Erika Lu, Weidi Xie, Andrew Zisserman
The model achieves competitive performance on cell and crowd counting datasets, and surpasses the state-of-the-art on the car dataset using only three training images.
no code implementations • ECCV 2018 • Weidi Xie, Li Shen, Andrew Zisserman
Our contributions are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of sets (each may contain a variable number of images) as inputs, and compute a similarity between the pair--this involves attending to multiple discriminative local regions (landmarks), and comparing local descriptors between pairs of faces; (ii) To encourage high-quality representations for each set, internal competition is introduced for recalibration based on the landmark score; (iii) Inspired by image retrieval, a novel hard sample mining regime is proposed to control the sampling process, such that the DCN is complementary to the standard image classification models.
1 code implementation • 24 Jul 2018 • Weidi Xie, Andrew Zisserman
In this paper, we design a neural network architecture that learns to aggregate based on both "visual" quality (resolution, illumination), and "content" quality (relative importance for discriminative classification).
Ranked #5 on
Face Verification
on IJB-C
(TAR @ FAR=1e-2 metric)
no code implementations • 3 Nov 2017 • Davis M. Vigneault, Weidi Xie, Carolyn Y. Ho, David A. Bluemke, J. Alison Noble
Pixelwise segmentation of the left ventricular (LV) myocardium and the four cardiac chambers in 2-D steady state free precession (SSFP) cine sequences is an essential preprocessing step for a wide range of analyses.
18 code implementations • 23 Oct 2017 • Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, Andrew Zisserman
The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise.
Ranked #1 on
Face Verification
on IJB-C
(training dataset metric)
no code implementations • 17 Jul 2017 • Yipeng Hu, Eli Gibson, Li-Lin Lee, Weidi Xie, Dean C. Barratt, Tom Vercauteren, J. Alison Noble
Sonography synthesis has a wide range of applications, including medical procedure simulation, clinical training and multimodality image registration.
no code implementations • 12 Apr 2017 • Davis M. Vigneault, Weidi Xie, David A. Bluemke, J. Alison Noble
Feature tracking Cardiac Magnetic Resonance (CMR) has recently emerged as an area of interest for quantification of regional cardiac function from balanced, steady state free precession (SSFP) cine sequences.