Explaining Categorical Feature Interactions Using Graph Covariance and LLMs

no code implementations24 Jan 2025 Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe

This graph covariance quantifies temporal changes in dependence structures within categorical data and is established as a consistent dependence measure under the Bernoulli distribution.


Principal Graph Encoder Embedding and Principal Community Detection

no code implementations24 Jan 2025 Cencheng Shen, Yuexiao Dong, Carey E. Priebe, Jonathan Larson, Ha Trinh, Youngser Park

We prove that the population principal graph encoder embedding preserves the conditional density of the vertex labels and that the population community score successfully distinguishes the principal communities.

Community Detection

Efficient Graph Encoder Embedding for Large Sparse Graphs in Python

no code implementations6 Jun 2024 Xihan Qin, Cencheng Shen

Graph is a ubiquitous representation of data in various research fields, and graph embedding is a prevalent machine learning technique for capturing key features and generating fixed-sized attributes.

Graph Embedding

Fast and Scalable Multi-Kernel Encoder Classifier

no code implementations4 Jun 2024 Cencheng Shen

This paper introduces a new kernel-based classifier by viewing kernel matrices as generalized graphs and leveraging recent progress in graph embedding techniques.

Graph Embedding

Encoder Embedding for General Graph and Node Classification

no code implementations24 May 2024 Cencheng Shen

Graph encoder embedding, a recent technique for graph data, offers speed and scalability in producing vertex-level representations from binary graphs.

Classification Node Classification

Refined Graph Encoder Embedding via Self-Training and Latent Community Recovery

no code implementations21 May 2024 Cencheng Shen, Jonathan Larson, Ha Trinh, Carey E. Priebe

We provide the theoretical rationale for the refinement procedure, demonstrating how and why our proposed method can effectively identify useful hidden communities via stochastic block models, and how the refinement method leads to improved vertex embedding and better decision boundaries for subsequent vertex classification.

Edge-Parallel Graph Encoder Embedding

1 code implementation6 Feb 2024 Ariel Lubonja, Cencheng Shen, Carey Priebe, Randal Burns

New algorithms for embedding graphs have reduced the asymptotic complexity of finding low-dimensional representations.

Discovering Communication Pattern Shifts in Large-Scale Labeled Networks using Encoder Embedding and Vertex Dynamics

1 code implementation3 May 2023 Cencheng Shen, Jonathan Larson, Ha Trinh, Xihan Qin, Youngser Park, Carey E. Priebe

Analyzing large-scale time-series network data, such as social media and email communications, poses a significant challenge in understanding social dynamics, detecting anomalies, and predicting trends.

Time Series

Synergistic Graph Fusion via Encoder Embedding

1 code implementation31 Mar 2023 Cencheng Shen, Carey E. Priebe, Jonathan Larson, Ha Trinh

In this paper, we introduce a method called graph fusion embedding, designed for multi-graph embedding with shared vertex sets.

Classification Graph Embedding +1

Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection

1 code implementation18 Jan 2023 Cencheng Shen, Youngser Park, Carey E. Priebe

In this paper, we introduce a novel and computationally efficient method for vertex embedding, community detection, and community size determination.

Community Detection

One-Hot Graph Encoder Embedding

3 code implementations27 Sep 2021 Cencheng Shen, Qizhe Wang, Carey E. Priebe

In this paper we propose a lightning fast graph embedding method called one-hot graph encoder embedding.

Clustering Graph Embedding +1

High-Dimensional Independence Testing via Maximum and Average Distance Correlations

no code implementations4 Jan 2020 Cencheng Shen, Yuexiao Dong

This paper introduces and investigates the utilization of maximum and average distance correlations for multivariate independence testing.

Vocal Bursts Intensity Prediction

The Chi-Square Test of Distance Correlation

1 code implementation27 Dec 2019 Cencheng Shen, Sambit Panda, Joshua T. Vogelstein

One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data.


Universally Consistent K-Sample Tests via Dependence Measures

no code implementations20 Oct 2019 Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz, Carey E. Priebe, Joshua T. Vogelstein

The K-sample testing problem involves determining whether K groups of data points are each drawn from the same distribution.

Two-sample testing

Independence Testing for Temporal Data

no code implementations18 Aug 2019 Cencheng Shen, Jaewon Chung, Ronak Mehta, Ting Xu, Joshua T. Vogelstein

While many non-parametric and universally consistent dependence measures have recently been proposed, directly applying them to temporal data can inflate the p-value and result in an invalid test.

Time Series Time Series Analysis +1

hyppo: A Multivariate Hypothesis Testing Python Package

4 code implementations3 Jul 2019 Sambit Panda, Satish Palaniappan, Junhao Xiong, Eric W. Bridgeford, Ronak Mehta, Cencheng Shen, Joshua T. Vogelstein

We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing.

Two-sample testing

Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities

1 code implementation30 Jun 2019 Ronan Perry, Ronak Mehta, Richard Guo, Eva Yezerets, Jesús Arroyo, Mike Powell, Hayden Helm, Cencheng Shen, Joshua T. Vogelstein

Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty.

Sparse Representation Classification via Screening for Graphs

no code implementations4 Jun 2019 Cencheng Shen, Li Chen, Yuexiao Dong, Carey Priebe

The sparse representation classifier (SRC) is shown to work well for image recognition problems that satisfy a subspace assumption.

Classification Classification Consistency +1

The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing

no code implementations14 Jun 2018 Cencheng Shen, Joshua T. Vogelstein

Distance-based tests, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community.

Two-sample testing

From Distance Correlation to Multiscale Graph Correlation

1 code implementation26 Oct 2017 Cencheng Shen, Carey E. Priebe, Joshua T. Vogelstein

Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age.

scientific discovery

Discovering and Deciphering Relationships Across Disparate Data Modalities

4 code implementations16 Sep 2016 Joshua T. Vogelstein, Eric Bridgeford, Qing Wang, Carey E. Priebe, Mauro Maggioni, Cencheng Shen

Understanding the relationships between different properties of data, such as whether a connectome or genome has information about disease status, is becoming increasingly important in modern biological datasets.

Computational Efficiency

Sparse Projection Oblique Randomer Forests

2 code implementations10 Jun 2015 Tyler M. Tomita, James Browne, Cencheng Shen, Jaewon Chung, Jesse L. Patsolic, Benjamin Falk, Jason Yim, Carey E. Priebe, Randal Burns, Mauro Maggioni, Joshua T. Vogelstein

Unfortunately, these extensions forfeit one or more of the favorable properties of decision forests based on axis-aligned splits, such as robustness to many noise dimensions, interpretability, or computational efficiency.

Computational Efficiency

Sparse Representation Classification Beyond L1 Minimization and the Subspace Assumption

no code implementations4 Feb 2015 Cencheng Shen, Li Chen, Yuexiao Dong, Carey E. Priebe

The results are demonstrated via simulations and real data experiments, where the new algorithm achieves comparable numerical performance and significantly faster.

Classification Classification Consistency +1

Manifold Matching using Shortest-Path Distance and Joint Neighborhood Selection

1 code implementation12 Dec 2014 Cencheng Shen, Joshua T. Vogelstein, Carey E. Priebe

Then the shortest-path distance within each modality is calculated from the joint neighborhood graph, followed by embedding into and matching in a common low-dimensional Euclidean space.

Robust Vertex Classification

no code implementations23 Nov 2013 Li Chen, Cencheng Shen, Joshua Vogelstein, Carey Priebe

For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs, adjacency spectral embedding followed by appropriate vertex classification is asymptotically Bayes optimal; but this approach requires knowledge of and critically depends on the model dimension.

Classification General Classification +1

Generalized Canonical Correlation Analysis for Classification

no code implementations30 Apr 2013 Cencheng Shen, Ming Sun, Minh Tang, Carey E. Priebe

For multiple multivariate data sets, we derive conditions under which Generalized Canonical Correlation Analysis (GCCA) improves classification performance of the projected datasets, compared to standard Canonical Correlation Analysis (CCA) using only two data sets.

Classification General Classification

On the Incommensurability Phenomenon

no code implementations9 Jan 2013 Donniell E. Fishkind, Cencheng Shen, Youngser Park, Carey E. Priebe

Suppose that two large, multi-dimensional data sets are each noisy measurements of the same underlying random process, and principle components analysis is performed separately on the data sets to reduce their dimensionality.

