Adversarial attack of structured prediction models faces various challenges such as the difficulty of perturbing discrete words, the sentence quality issue, and the sensitivity of outputs to small perturbations.
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
In this paper, to model the uncertainty of visual saliency, we study the saliency prediction problem from the perspective of generative models by learning a conditional probability distribution over the saliency map given an input image, and treating the saliency prediction as a sampling process from the learned distribution.
Exploiting internal statistics of a single natural image has long been recognized as a significant research paradigm where the goal is to learn the distribution of patches within the image without relying on external training data.
By aggregating different beliefs and true world states, our model essentially forms "five minds" during the interactions between two agents.
This paper studies the unsupervised cross-domain translation problem by proposing a generative framework, in which the probability distribution of each domain is represented by a generative cooperative network that consists of an energy-based model and a latent variable model.
In this paper, we propose to learn a variational auto-encoder (VAE) to initialize the finite-step MCMC, such as Langevin dynamics that is derived from the energy function, for efficient amortized sampling of the EBM.
3D data that contains rich geometry information of objects and scenes is valuable for understanding 3D physical world.
Aiming to understand how human (false-)belief--a core socio-cognitive ability--would affect human interactions with robots, this paper proposes to adopt a graphical model to unify the representation of object states, robot knowledge, and human (false-)beliefs.
We propose a generative model of unordered point sets, such as point clouds, in the form of an energy-based model, where the energy function is parameterized by an input-permutation-invariant bottom-up neural network.
To model the motions explicitly, it is natural for the model to be based on the motions or the displacement fields of the pixels.
This paper studies the problem of learning the conditional distribution of a high-dimensional output given an input, where the output and input may belong to two different domains, e. g., the output is a photo image and the input is a sketch image.
The non-linear transformation of this transition model can be parametrized by a feedforward neural network.
This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns.