Stochastic neural networks with discrete random variables are an important class of models for their expressiveness and interpretability.
Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning.
Conventional Computed Tomography (CT) methods require large numbers of noise-free projections for accurate density reconstructions, limiting their applicability to the more complex class of Cone Beam Geometry CT (CBCT) reconstruction.
The problem of detecting and quantifying the presence of symmetries in datasets is useful for model selection, generative modeling, and data analysis, amongst others.
Identifying the causal variables of an environment and how to intervene on them is of core value in applications such as robotics and embodied AI.
Dynamical systems with complex behaviours, e. g. immune system cells interacting with a pathogen, are commonly modelled by splitting the behaviour into different regimes, or modes, each with simpler dynamics, and then learning the switching behaviour from one mode to another.
Click-based interactive segmentation aims to generate target masks via human clicking, which facilitates efficient pixel-level annotation and image editing.
Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos.
Neural ordinary differential equations (NODEs) have been proven useful for learning non-linear dynamics of arbitrary trajectories.
Performant Convolutional Neural Network (CNN) architectures must be tailored to specific tasks in order to consider the length, resolution, and dimensionality of the input data.
However, the appearance variations between objects from the same category could be extremely large, leading to unreliable feature matching and query mask prediction.
Ranked #39 on Few-Shot Semantic Segmentation on PASCAL-5i (1-Shot)
To solve the graph cuts our solution relies on an efficient, scalable, and differentiable quadratic programming approximation.
Our results suggest that reservoir computing is a promising candidate framework for the continual learning of dynamical systems.
To address this issue, we propose iCITRIS, a causal representation learning method that allows for instantaneous effects in intervened temporal sequences when intervention targets can be observed, e. g., as actions of an agent.
The use of Convolutional Neural Networks (CNNs) is widespread in Deep Learning due to a range of desirable model properties which result in an efficient and effective machine learning framework.
Effectiveness of our method for both the near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control.
Ranked #2 on Anomaly Detection on One-class CIFAR-10 (using extra training data)
Specifically, in DPCN, a dynamic convolution module (DCM) is firstly proposed to generate dynamic kernels from support foreground, then information interaction is achieved by convolution operations over query features using these kernels.
Ranked #31 on Few-Shot Semantic Segmentation on PASCAL-5i (1-Shot)
To tackle this issue, we propose a Neighbor Transformer Network, or NFormer, which explicitly models interactions across all input images, thus suppressing outlier features and leading to more robust representations overall.
In this work, we address two key limitations of such representations, in failing to capture local 3D geometric fine details, and to learn from and generalize to shapes with unseen 3D transformations.
By extensive experiments on a wide range of architectures, including the most efficient ones, we demonstrate that delta distillation sets a new state of the art in terms of accuracy vs. efficiency trade-off for semantic segmentation and object detection in videos.
Ranked #2 on Video Semantic Segmentation on Cityscapes val
We consider multi-agent reinforcement learning (MARL) for cooperative communication and coordination tasks.
Understanding the latent causal factors of a dynamical system from visual observations is considered a crucial step towards agents reasoning in complex environments.
Modelling interactions is critical in learning complex dynamical systems, namely systems of interacting objects with highly non-linear and time-dependent behaviour.
Stability regularization is method to make the output of continuous functions of Gaussian random variables close to discrete, that is binary or categorical, without the need for significant manual tuning.
We present WeakSTIL, an interpretable two-stage weak label deep learning pipeline for scoring the percentage of stromal tumor infiltrating lymphocytes (sTIL%) in H&E-stained whole-slide images (WSIs) of breast cancer tissue.
Learning the structure of a causal graphical model using both observational and interventional data is a fundamental problem in many scientific fields.
For MSI prediction in a tumor-annotated and color normalized subset of TCGA-CRC (n=360 patients), contrastive self-supervised learning improves the tile supervision baseline from 0. 77 to 0. 87 AUROC, on par with our proposed DeepSMILE method.
Federated learning (FL) has emerged as the predominant approach for collaborative training of neural network models across multiple users, without the need to gather the data at a central location.
In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives.
In experiments, we demonstrate the improved sample efficiency of GP BO using FM kernels (BO-FM). On synthetic problems and hyperparameter optimization problems, BO-FM outperforms competitors consistently.
Variational autoencoders with deep hierarchies of stochastic layers have been known to suffer from the problem of posterior collapse, where the top layers fall back to the prior and become independent of input.
We further show that this change in orientation can be used to impose an additional motion constraint in Siamese tracking through imposing restriction on the change in orientation between two consecutive frames.
Classically, visual object tracking involves following a target object throughout a given video, and it provides us the motion trajectory of the object.
Our experiments show that SSC leads to an important increase in interaction recognition performance, while using much fewer parameters.
In this paper, we define data augmentation between point clouds as a shortest path linear interpolation.
Ranked #3 on 3D Point Cloud Data Augmentation on ModelNet40
Specifically, we present structured dropout to mimick the change in latent codes under occlusion.
We study the three properties of PIC and demonstrate its effectiveness in recognizing the long-range activities of Charades, Breakfast, and MultiThumos.
Tracking and segmentation of biological cells in video sequences is a challenging problem, especially due to the similarity of the cells and high levels of inherent noise.
The Critic network is environmentally aware to prune trajectories that are in collision or are in violation with the environment.
Learning suitable latent representations for observed, high-dimensional data is an important research topic underlying many recent advances in machine learning.
We propose to model the effective receptive field of 2D convolution based on the scale and locality from the 3D neighborhood.
Adversarial training has been recently employed for realizing structured semantic segmentation, in which the aim is to preserve higher-level scene structural consistencies in dense predictions.
In response to this, Scellier & Bengio (2017) proposed Equilibrium Propagation - a method for gradient-based train- ing of neural networks which uses only local learning rules and, crucially, does not rely on neurons having a mechanism for back-propagating an error gradient.
We observe many continuous output problems in computer vision are naturally contained in closed geometrical manifolds, like the Euler angles in viewpoint estimation or the normals in surface normal estimation.
On this combinatorial graph, we propose an ARD diffusion kernel with which the GP is able to model high-order interactions between variables leading to better performance.
This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued.
Ranked #27 on Action Classification on Charades
Neural network quantization has become an important research area due to its great impact on deployment of large models on resource constrained devices.
To demonstrate the effectiveness of our proposed framework, we modify associative domain adaptation to work well on source and target data batches with unequal class distributions.
A major challenge in Bayesian Optimization is the boundary issue (Swersky, 2017) where an algorithm spends too many evaluations near the boundary of its search space.
We introduce the OxUvA dataset and benchmark for evaluating single-object tracking algorithms.
In this work we propose a blackbox intervention method for visual dialog models, with the aim of assessing the contribution of individual linguistic or visual components.
This paper strives to track a target object in a video.
Ranked #17 on Referring Expression Segmentation on J-HMDB
We present a variant on backpropagation for neural networks in which computation scales with the rate of change of the data - not the rate at which we process the data.
This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos.
On action classification, our method obtains 60. 3\% on the UCF101 dataset using only UCF101 data for training which is approximately 10% better than current state-of-the-art self-supervised learning methods.
Ranked #46 on Self-Supervised Action Recognition on UCF101
We present a new architecture for end-to-end sequence learning of actions in video, we call VideoLSTM.
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used.
Ranked #60 on Action Recognition on HMDB-51
In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos.
Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated.
We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation.
We present a supervised learning to rank algorithm that effectively orders images by exploiting the structure in image sequences.
Undoing the image formation process and therefore decomposing appearance into its intrinsic properties is a challenging task due to the under-constraint nature of this inverse problem.
How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data?
We postulate that a function capable of ordering the frames of a video temporally (based on the appearance) captures well the evolution of the appearance within the video.
In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes.