Search Results for author: Stephen Gould

Found 85 papers, 30 papers with code

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

no code implementations25 May 2023 Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould

Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time.

Image Retrieval Re-Ranking +1

GoferBot: A Visual Guided Human-Robot Collaborative Assembly System

no code implementations18 Apr 2023 Zheyu Zhuang, Yizhak Ben-Shabat, Jiahao Zhang, Stephen Gould, Robert Mahony

It is composed of a visual servoing module that reaches and grasps assembly parts in an unstructured multi-instance and dynamic environment, an action recognition module that performs human action prediction for implicit communication, and a visual handover module that uses the perceptual understanding of human behaviour to produce an intuitive and efficient collaborative assembly experience.

Action Recognition

Adaptive Cross Batch Normalization for Metric Learning

no code implementations30 Mar 2023 Thalaiyasingam Ajanthan, Matt Ma, Anton Van Den Hengel, Stephen Gould

In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration as the learnable parameters are being updated.

Image Retrieval Metric Learning +1

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

no code implementations29 Mar 2023 Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, Stephen Gould

Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes.

Composed Image Retrieval (CoIR) Retrieval

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

1 code implementation CVPR 2023 Jiahao Zhang, Anoop Cherian, Yanbin Liu, Yizhak Ben-Shabat, Cristian Rodriguez, Stephen Gould

In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world.

Contrastive Learning Image Retrieval +2

Deep Declarative Dynamic Time Warping for End-to-End Learning of Alignment Paths

1 code implementation19 Mar 2023 Ming Xu, Sourav Garg, Michael Milford, Stephen Gould

An interesting byproduct of this formulation is that DecDTW outputs the optimal warping path between two time series as opposed to a soft approximation, recoverable from Soft-DTW.

Dynamic Time Warping Information Retrieval +3

Learning to Select Camera Views: Efficient Multiview Understanding at Few Glances

1 code implementation10 Mar 2023 Yunzhong Hou, Stephen Gould, Liang Zheng

Multiview camera setups have proven useful in many computer vision applications for reducing ambiguities, mitigating occlusions, and increasing field-of-view coverage.

Confidence and Dispersity Speak: Characterising Prediction Matrix for Unsupervised Accuracy Estimation

no code implementations2 Feb 2023 Weijian Deng, Yumin Suh, Stephen Gould, Liang Zheng

This work aims to assess how well a model performs under distribution shifts without using labels.

Octree Guided Unoriented Surface Reconstruction

no code implementations CVPR 2023 Chamin Hewa Koneputugodage, Yizhak Ben-Shabat, Stephen Gould

We propose a two-step approach, OG-INR, where we (1) construct a discrete octree and label what is inside and outside (2) optimize for a continuous and high-fidelity shape using an INR that is initially guided by the octree's labelling.

Surface Reconstruction

Understanding and Improving the Role of Projection Head in Self-Supervised Learning

no code implementations22 Dec 2022 Kartik Gupta, Thalaiyasingam Ajanthan, Anton Van Den Hengel, Stephen Gould

Most current contrastive learning approaches append a parametrized projection head to the end of some backbone network to optimize the InfoNCE objective and then discard the learned projection head after training.

Contrastive Learning Image Classification +1

NeRFEditor: Differentiable Style Decomposition for Full 3D Scene Editing

no code implementations7 Dec 2022 Chunyi Sun, Yanbin Liu, Junlin Han, Stephen Gould

Specifically, we use a NeRF model to generate numerous image-angle pairs to train an adjustor, which can adjust the StyleGAN latent code to generate high-fidelity stylized images for any given angle.

Self-Supervised Learning

Multi-View Correlation Consistency for Semi-Supervised Semantic Segmentation

no code implementations17 Aug 2022 Yunzhong Hou, Stephen Gould, Liang Zheng

In this paper, we take the best of both worlds and propose multi-view correlation consistency (MVCC) learning: it considers rich pairwise relationships in self-correlation matrices and matches them across views to provide robust supervision.

Contrastive Learning Data Augmentation +1

Learning to Structure an Image with Few Colors and Beyond

no code implementations17 Aug 2022 Yunzhong Hou, Liang Zheng, Stephen Gould

To this end, we propose a color quantization network, ColorCNN, which learns to structure an image in limited color spaces by minimizing the classification loss.

Image Compression Imitation Learning +1

On the Strong Correlation Between Model Invariance and Generalization

no code implementations14 Jul 2022 Weijian Deng, Stephen Gould, Liang Zheng

Generalization and invariance are two essential properties of any machine learning model.

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

1 code implementation CVPR 2022 Yicong Hong, Zun Wang, Qi Wu, Stephen Gould

To bridge the discrete-to-continuous gap, we propose a predictor to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environments.

Imitation Learning Vision and Language Navigation

Exploiting Problem Structure in Deep Declarative Networks: Two Case Studies

no code implementations24 Feb 2022 Stephen Gould, Dylan Campbell, Itzik Ben-Shabat, Chamin Hewa Koneputugodage, Zhiwei Xu

Deep declarative networks and other recent related works have shown how to differentiate the solution map of a (continuous) parametrized optimization problem, opening up the possibility of embedding mathematical optimization problems into end-to-end learnable models.

Vocal Bursts Valence Prediction

A Regularized Wasserstein Framework for Graph Kernels

1 code implementation6 Oct 2021 Asiri Wijesinghe, Qing Wang, Stephen Gould

This framework provides a novel optimal transport distance metric, namely Regularized Wasserstein (RW) discrepancy, which can preserve both features and structure of graphs via Wasserstein distances on features and their local variations, local barycenters and global connectivity.

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

3 code implementations ICCV 2021 Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, Stephen Gould

We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion.

Composed Image Retrieval (CoIR) Retrieval +1

DiGS : Divergence guided shape implicit neural representation for unoriented point clouds

1 code implementation21 Jun 2021 Yizhak Ben-Shabat, Chamin Hewa Koneputugodage, Stephen Gould

In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input.

Representation Learning Surface Reconstruction

What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments?

no code implementations10 Jun 2021 Weijian Deng, Stephen Gould, Liang Zheng

In this work, we train semantic classification and rotation prediction in a multi-task way.

Semantics for Robotic Mapping, Perception and Interaction: A Survey

no code implementations2 Jan 2021 Sourav Garg, Niko Sünderhauf, Feras Dayoub, Douglas Morrison, Akansel Cosgun, Gustavo Carneiro, Qi Wu, Tat-Jun Chin, Ian Reid, Stephen Gould, Peter Corke, Michael Milford

In robotics and related research fields, the study of understanding is often referred to as semantics, which dictates what does the world "mean" to a robot, and is strongly tied to the question of how to represent that meaning.

Autonomous Driving Navigate

Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

no code implementations CVPR 2021 Fatemeh Saleh, Sadegh Aliakbarian, Hamid Rezatofighi, Mathieu Salzmann, Stephen Gould

Despite the recent advances in multiple object tracking (MOT), achieved by joint detection and tracking, dealing with long occlusions remains a challenge.

Multiple Object Tracking

Rethinking conditional GAN training: An approach using geometrically structured latent manifolds

1 code implementation NeurIPS 2021 Sameera Ramasinghe, Moshiur Farazi, Salman Khan, Nick Barnes, Stephen Gould

Conditional GANs (cGAN), in their rudimentary form, suffer from critical drawbacks such as the lack of diversity in generated outputs and distortion between the latent and output manifolds.

Image-to-Image Translation Translation

Language and Visual Entity Relationship Graph for Agent Navigation

1 code implementation NeurIPS 2020 Yicong Hong, Cristian Rodriguez-Opazo, Yuankai Qi, Qi Wu, Stephen Gould

From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment.

Dynamic Time Warping Navigate +2

DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video

1 code implementation13 Oct 2020 Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hongdong Li, Stephen Gould

This paper studies the task of temporal moment localization in a long untrimmed video using natural language query.

Conditional Generative Modeling via Learning the Latent Space

no code implementations ICLR 2021 Sameera Ramasinghe, Kanchana Ranasinghe, Salman Khan, Nick Barnes, Stephen Gould

Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings.

Solving the Blind Perspective-n-Point Problem End-To-End With Robust Differentiable Geometric Optimization

2 code implementations ECCV 2020 Dylan Campbell, Liu Liu, Stephen Gould

We instead propose the first fully end-to-end trainable network for solving the blind PnP problem efficiently and globally, that is, without the need for pose priors.

Bidirectionally Self-Normalizing Neural Networks

no code implementations22 Jun 2020 Yao Lu, Stephen Gould, Thalaiyasingam Ajanthan

The problem of vanishing and exploding gradients has been a long-standing obstacle that hinders the effective training of neural networks.

A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

no code implementations WS 2020 Edison Marrese-Taylor, Cristian Rodriguez-Opazo, Jorge A. Balazs, Stephen Gould, Yutaka Matsuo

Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews.

Opinion Mining

ArTIST: Autoregressive Trajectory Inpainting and Scoring for Tracking

no code implementations16 Apr 2020 Fatemeh Saleh, Sadegh Aliakbarian, Mathieu Salzmann, Stephen Gould

One of the core components in online multiple object tracking (MOT) frameworks is associating new detections with existing tracklets, typically done via a scoring function.

Human motion prediction motion prediction +1

Sub-Instruction Aware Vision-and-Language Navigation

1 code implementation EMNLP 2020 Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, Stephen Gould

Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.

Navigate Vision and Language Navigation

Joint Unsupervised Learning of Optical Flow and Egomotion with Bi-Level Optimization

no code implementations26 Feb 2020 Shihao Jiang, Dylan Campbell, Miaomiao Liu, Stephen Gould, Richard Hartley

We address the problem of joint optical flow and camera motion estimation in rigid scenes by incorporating geometric constraints into an unsupervised deep learning framework.

Motion Estimation Optical Flow Estimation

Contextually Plausible and Diverse 3D Human Motion Prediction

no code implementations ICCV 2021 Sadegh Aliakbarian, Fatemeh Sadat Saleh, Lars Petersson, Stephen Gould, Mathieu Salzmann

We tackle the task of diverse 3D human motion prediction, that is, forecasting multiple plausible future 3D poses given a sequence of observed 3D poses.

Human motion prediction Image Captioning +1

Spectral-GANs for High-Resolution 3D Point-cloud Generation

1 code implementation4 Dec 2019 Sameera Ramasinghe, Salman Khan, Nick Barnes, Stephen Gould

Point-clouds are a popular choice for vision and graphics tasks due to their accurate shape description and direct acquisition from range-scanners.

Point Cloud Generation Vocal Bursts Intensity Prediction

Representation Learning on Unit Ball with 3D Roto-Translational Equivariance

no code implementations30 Nov 2019 Sameera Ramasinghe, Salman Khan, Nick Barnes, Stephen Gould

In this work, we propose a novel `\emph{volumetric convolution}' operation that can effectively model and convolve arbitrary functions in $\mathbb{B}^3$.

3D Object Recognition Representation Learning

Deep Declarative Networks: A New Hope

1 code implementation11 Sep 2019 Stephen Gould, Richard Hartley, Dylan Campbell

We show how these declarative processing nodes can be implemented in the popular PyTorch deep learning software library allowing declarative and imperative nodes to co-exist within the same network.

Point Cloud Classification

Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention

1 code implementation20 Aug 2019 Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, Stephen Gould

Given an untrimmed video and a sentence as the query, the goal is to determine the starting, and the ending, of the relevant visual moment in the video, that corresponds to the query sentence.

Learning Variations in Human Motion via Mix-and-Match Perturbation

no code implementations2 Aug 2019 Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, Stephen Gould, Amirhossein Habibian

In this paper, we introduce an approach to stochastically combine the root of variations with previous pose information, which forces the model to take the noise into account.

Human motion prediction motion prediction

A Signal Propagation Perspective for Pruning Neural Networks at Initialization

1 code implementation ICLR 2020 Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, Philip H. S. Torr

Alternatively, a recent approach shows that pruning can be done at initialization prior to training, based on a saliency criterion called connection sensitivity.

Image Classification Network Pruning

The Alignment of the Spheres: Globally-Optimal Spherical Mixture Alignment for Camera Pose Estimation

no code implementations CVPR 2019 Dylan Campbell, Lars Petersson, Laurent Kneip, Hongdong Li, Stephen Gould

Determining the position and orientation of a calibrated camera from a single image with respect to a 3D model is an essential task for many applications.

Pose Estimation

Partially-Supervised Image Captioning

no code implementations NeurIPS 2018 Peter Anderson, Stephen Gould, Mark Johnson

To address this problem, we teach image captioning models new visual concepts from labeled images and object detection datasets.

Image Captioning object-detection +1

Non-Linear Temporal Subspace Representations for Activity Recognition

no code implementations CVPR 2018 Anoop Cherian, Suvrit Sra, Stephen Gould, Richard Hartley

As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order.

Action Recognition Riemannian optimization +2

Video Representation Learning Using Discriminative Pooling

no code implementations CVPR 2018 Jue Wang, Anoop Cherian, Fatih Porikli, Stephen Gould

In an attempt to tackle this problem, we propose discriminative pooling, based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action.

Action Recognition In Videos Multiple Instance Learning +2

Neural Algebra of Classifiers

no code implementations26 Jan 2018 Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, Stephen Gould

In this paper, we build on the compositionality principle and develop an "algebra" to compose classifiers for complex visual concepts.

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

8 code implementations CVPR 2018 Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, Anton Van Den Hengel

This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.

Translation Vision and Language Navigation +2

Human Action Forecasting by Learning Task Grammars

no code implementations19 Sep 2017 Tengda Han, Jue Wang, Anoop Cherian, Stephen Gould

For effective human-robot interaction, it is important that a robotic assistant can forecast the next action a human will consider in a given task.

Action Recognition Temporal Action Localization

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

63 code implementations CVPR 2018 Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

Image Captioning Visual Question Answering

Human Pose Forecasting via Deep Markov Models

no code implementations24 Jul 2017 Sam Toyer, Anoop Cherian, Tengda Han, Stephen Gould

Human pose forecasting is an important problem in computer vision with applications to human-robot interaction, visual surveillance, and autonomous driving.

Autonomous Driving Human Pose Forecasting

Incorporating Network Built-in Priors in Weakly-supervised Semantic Segmentation

no code implementations6 Jun 2017 Fatemeh Sadat Saleh, Mohammad Sadegh Aliakbarian, Mathieu Salzmann, Lars Petersson, Jose M. Alvarez, Stephen Gould

We then show how to obtain multi-class masks by the fusion of foreground/background ones with information extracted from a weakly-supervised localization network.

Object Recognition TAG +2

Discriminatively Learned Hierarchical Rank Pooling Networks

1 code implementation30 May 2017 Basura Fernando, Stephen Gould

First, we present "discriminative rank pooling" in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem.

Activity Recognition Bilevel Optimization +1

Second-order Temporal Pooling for Action Recognition

no code implementations23 Apr 2017 Anoop Cherian, Stephen Gould

We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space.

Action Recognition Temporal Action Localization

Generalized Rank Pooling for Activity Recognition

no code implementations CVPR 2017 Anoop Cherian, Basura Fernando, Mehrtash Harandi, Stephen Gould

Most popular deep models for action recognition split video sequences into short sub-sequences consisting of a few frames; frame-based features are then pooled for recognizing the activity.

Action Recognition Riemannian optimization +1

Action Representation Using Classifier Decision Boundaries

no code implementations6 Apr 2017 Jue Wang, Anoop Cherian, Fatih Porikli, Stephen Gould

Applying multiple instance learning in an SVM setup, we use the parameters of this separating hyperplane as a descriptor for the video.

Action Recognition Multiple Instance Learning +1

Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition

no code implementations19 Jan 2017 Anoop Cherian, Piotr Koniusz, Stephen Gould

The HOK descriptors are then generated from the higher-order co-occurrences of these feature maps, and are then used as input to a video-level classifier.

Fine-grained Action Recognition Object Recognition +1

Unsupervised Human Action Detection by Action Matching

no code implementations2 Dec 2016 Basura Fernando, Sareh Shirazi, Stephen Gould

On the MPII Cooking dataset we detect action segments with a precision of 21. 6% and recall of 11. 7% over 946 long video pairs and over 5000 ground truth action segments.

Action Detection Activity Recognition

Self-Supervised Video Representation Learning With Odd-One-Out Networks

no code implementations CVPR 2017 Basura Fernando, Hakan Bilen, Efstratios Gavves, Stephen Gould

On action classification, our method obtains 60. 3\% on the UCF101 dataset using only UCF101 data for training which is approximately 10% better than current state-of-the-art self-supervised learning methods.

Action Classification General Classification +5

SPICE: Semantic Propositional Image Caption Evaluation

11 code implementations29 Jul 2016 Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould

There is considerable interest in the task of automatically generating image captions.

Image Captioning

Dynamic Image Networks for Action Recognition

1 code implementation CVPR 2016 Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, Stephen Gould

We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used.

Action Recognition Temporal Action Localization

Hierarchical Higher-Order Regression Forest Fields: An Application to 3D Indoor Scene Labelling

no code implementations ICCV 2015 Trung T. Pham, Ian Reid, Yasir Latif, Stephen Gould

Specifically, we relax the labelling problem to a regression, and generalize the higher-order associative P n Potts model to a new family of arbitrary higher-order models based on regression forests.

regression Semantic Segmentation

Deep CNN Ensemble with Data Augmentation for Object Detection

no code implementations24 Jun 2015 Jian Guo, Stephen Gould

We report on the methods used in our recent DeepEnsembleCoco submission to the PASCAL VOC 2012 challenge, which achieves state-of-the-art performance on the object detection task.

Data Augmentation object-detection +1

An Exemplar-based CRF for Multi-instance Object Segmentation

no code implementations CVPR 2014 Xuming He, Stephen Gould

We address the problem of joint detection and segmentation of multiple object instances in an image, a key step towards scene understanding.

Instance Segmentation Scene Understanding +1

Region-based Segmentation and Object Detection

no code implementations NeurIPS 2009 Stephen Gould, Tianshi Gao, Daphne Koller

Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other.

General Classification Image Segmentation +3

Cascaded Classification Models: Combining Models for Holistic Scene Understanding

no code implementations NeurIPS 2008 Geremy Heitz, Stephen Gould, Ashutosh Saxena, Daphne Koller

We demonstrate the effectiveness of our method on a large set of natural images by combining the subtasks of scene categorization, object detection, multiclass image segmentation, and 3d scene reconstruction.

3D Reconstruction 3D Scene Reconstruction +7

Learning Bounded Treewidth Bayesian Networks

no code implementations NeurIPS 2008 Gal Elidan, Stephen Gould

In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial in the size of the graph and the treewidth bound.

Cannot find the paper you are looking for? You can Submit a new open access paper.