A number of universally consistent dependence measures have been recently proposed for testing independence, such as distance correlation, kernel correlation, multiscale graph correlation, etc.
One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data.
The $k$-sample testing problem tests whether or not $k$ groups of data points are sampled from the same distribution.
The proposed nonparametric procedure is valid and consistent, building upon prior work by characterizing the geometry of the relationship, estimating the time lag at which dependence is maximized, avoiding the need for multiple testing, and exhibiting superior power in high-dimensional, low sample size, nonlinear settings.
We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing.
Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty.
The sparse representation classifier (SRC) is shown to work well for image recognition problems that satisfy a subspace assumption.
It has been demonstrated that these proximity matrices can be thought of as kernels, connecting the decision forest literature to the extensive kernel machine literature.
Distance-based tests, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community.
Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age.
Understanding the relationships between different properties of data, such as whether a connectome or genome has information about disease status, is becoming increasingly important in modern biological datasets.
2 code implementations • 10 Jun 2015 • Tyler M. Tomita, James Browne, Cencheng Shen, Jaewon Chung, Jesse L. Patsolic, Benjamin Falk, Jason Yim, Carey E. Priebe, Randal Burns, Mauro Maggioni, Joshua T. Vogelstein
Unfortunately, these extensions forfeit one or more of the favorable properties of decision forests based on axis-aligned splits, such as robustness to many noise dimensions, interpretability, or computational efficiency.
The results are demonstrated via simulations and real data experiments, where the new algorithm achieves comparable numerical performance and significantly faster.
Then the shortest-path distance within each modality is calculated from the joint neighborhood graph, followed by embedding into and matching in a common low-dimensional Euclidean space.
For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs, adjacency spectral embedding followed by appropriate vertex classification is asymptotically Bayes optimal; but this approach requires knowledge of and critically depends on the model dimension.
For multiple multivariate data sets, we derive conditions under which Generalized Canonical Correlation Analysis (GCCA) improves classification performance of the projected datasets, compared to standard Canonical Correlation Analysis (CCA) using only two data sets.
Suppose that two large, multi-dimensional data sets are each noisy measurements of the same underlying random process, and principle components analysis is performed separately on the data sets to reduce their dimensionality.