We empirically show that using (natural) gradient descent on the smooth manifold approximation instead of the singular space allows us to avoid the attractor behavior and therefore improve the convergence speed in learning.
In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces.
We propose a methodology to approximate conditional distributions in the elliptope of correlation matrices based on conditional generative adversarial networks.
Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with pros and cons have been proposed in the literature to either estimate, approximate, or lower and upper bound this divergence.
Many common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average.
We generalize the Jensen-Shannon divergence by considering a variational definition with respect to a generic mean extending thereby the notion of Sibson's information radius.
Quantization Information Theory Information Theory
Natural-gradient descent (NGD) on structured parameter spaces (e. g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations.
We prove that the $f$-divergences between univariate Cauchy distributions are all symmetric, and can be expressed as strictly increasing scalar functions of the symmetric chi-squared divergence.
Information Theory Information Theory Statistics Theory Statistics Theory
We study information projections with respect to statistical $f$-divergences between any two location-scale families.
Information Theory Information Theory
The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints, while the geometric mixture path is common in MCMC methods such as annealed importance sampling.
Annealed importance sampling (AIS) is the gold standard for estimating partition functions or marginal likelihoods, corresponding to importance sampling over a path of distributions between a tractable base and an unnormalized target.
We prove that the Voronoi diagrams of the Fisher-Rao distance, the chi square divergence, and the Kullback-Leibler divergences all coincide with a hyperbolic Voronoi diagram on the corresponding Cauchy location-scale parameters, and that the dual Cauchy hyperbolic Delaunay complexes are Fisher orthogonal to the Cauchy hyperbolic Voronoi diagrams.
It is well-known that the Bhattacharyya, Hellinger, Kullback-Leibler, $\alpha$-divergences, and Jeffreys' divergences between densities belonging to a same exponential family have generic closed-form formulas relying on the strictly convex and real-analytic cumulant function characterizing the exponential family.
Distances between probability distributions that take into account the geometry of their sample space, like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports.
This letter introduces an abstract learning problem called the "set embedding": The objective is to map sets into probability distributions so as to lose less information.
The dualistic structure of statistical manifolds in information geometry yields eight types of geodesic triangles passing through three given points, the triangle vertices.
We then define the strictly quasiconvex Bregman divergences as the limit case of scaled and skewed quasiconvex Jensen divergences, and report a simple closed-form formula which shows that these divergences are only pseudo-divergences at countably many inflection points of the generators.
The Jensen-Shannon divergence is a renown bounded symmetrization of the unbounded Kullback-Leibler divergence which measures the total Kullback-Leibler divergence to the average mixture distribution.
We consider both finite and infinite power chi expansions of $f$-divergences derived from Taylor's expansions of smooth generators, and elaborate on cases where these expansions yield closed-form formula, bounded approximations, or analytic divergence series expressions of $f$-divergences.
The traditional Minkowski distances are induced by the corresponding Minkowski norms in real-valued vector spaces.
We experimentally evaluate our new family of distances by quantifying the upper bounds of several jointly convex distances between statistical mixtures, and by proposing a novel efficient method to learn Gaussian mixture models (GMMs) by simplifying kernel density estimators with respect to our distance.
Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings.
We show that minimizing the p-Wasserstein distance between the generator and the true data distribution is equivalent to the unconstrained min-min optimization of the p-Wasserstein distance between the encoder aggregated posterior and the prior in latent space, plus a reconstruction error.
In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences.
The total variation distance is a core statistical distance between probability measures that satisfies the metric axioms, with value always falling in $[0, 1]$.
We propose a new generic type of stochastic neurons, called $q$-neurons, that considers activation functions based on Jackson's $q$-derivatives with stochastic parameters $q$.
When equipping a statistical manifold with the KL divergence, the induced manifold structure is dually flat, and the KL divergence between distributions amounts to an equivalent Bregman divergence on their corresponding parameters.
We introduce a novel family of distances, called the chord gap divergences, that generalizes the Jensen divergences (also called the Burbea-Rao distances), and study its properties.
We demonstrate its efficiency on the task of generating melodies satisfying positional constraints in the style of the soprano parts of the J. S.
These musical sequences belong to a given corpus (or style) and it is obvious that a good distance on musical sequences should take this information into account; being able to define a distance ex nihilo which could be applicable to all music styles seems implausible.
Information Retrieval Sound
The information geometry induced by the Bregman generator set to the Shannon negentropy on this space yields a dually flat space called the mixture family manifold.
VAEs (Variational AutoEncoders) have proved to be powerful in the context of density modeling and have been used in a variety of contexts for creative purposes.
In Valiant's model of evolution, a class of representations is evolvable iff a polynomial-time process of random mutations guided by selection converges with high probability to a representation as $\epsilon$-close as desired from the optimal one, for any required $\epsilon>0$.
In the Hilbert simplex geometry, the distance is the non-separable Hilbert's metric distance which satisfies the property of information monotonicity with distance level set functions described by polytope boundaries.
We review the state of the art of clustering financial time series and the study of their correlations alongside other interaction networks.
Comparative convexity is a generalization of convexity relying on abstract notions of means.
We describe a framework to build distances by measuring the tightness of inequalities, and introduce the notion of proper statistical divergences and improper pseudo-divergences.
We present a series of closed-form maximum entropy upper bounds for the differential entropy of a continuous univariate random variable and study the properties of that series.
We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset.
We consider the supervised classification problem of machine learning in Cayley-Klein projective geometries: We show how to learn a curved Mahalanobis metric distance corresponding to either the hyperbolic geometry or the elliptic geometry using the Large Margin Nearest Neighbor (LMNN) framework.
We also present the first application of optimal transport to the problem of ecological inference, that is, the reconstruction of joint distributions from their marginals, a problem of large interest in the social sciences.
Information-theoretic measures such as the entropy, cross-entropy and the Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks.
This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables.
Matrix data sets are common nowadays like in biomedical imaging where the Diffusion Tensor Magnetic Resonance Imaging (DT-MRI) modality produces data sets of 3D symmetric positive definite matrices anchored at voxel positions capturing the anisotropic diffusion properties of water molecules in biological tissues.
State-of-the-art methods via subspace clustering seek to solve the problem in two steps: First, an affinity matrix is built from data, with appearance features or motion patterns.
Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations.
We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss.
For either the specific frameworks considered here, or for the differential privacy setting, there is little to no prior results on the direct application of k-means++ and its approximation bounds --- state of the art contenders appear to be significantly more complex and / or display less favorable (approximation) properties.
This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas.
We present a generic dynamic programming method to compute the optimal clustering of $n$ scalar elements into $k$ pairwise disjoint intervals.
When no cost incurs for correct classification and unit cost is charged for misclassification, Bayes' test reduces to the maximum a posteriori decision rule, and Bayes risk simplifies to Bayes' error, the probability of error.
Clustering histograms can be performed using the celebrated $k$-means centroid-based algorithm.
Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates.