Data Summarization

33 papers with code • 0 benchmarks • 2 datasets

Data Summarization is a central problem in the area of machine learning, where we want to compute a small summary of the data.

Source: How to Solve Fair k-Center in Massive Data Models

Libraries

Use these libraries to find Data Summarization models and implementations

Most implemented papers

Fair k-Center Clustering for Data Summarization

matthklein/fair_k_center_clustering 24 Jan 2019

In data summarization we want to choose $k$ prototypes in order to summarize a data set.

apricot: Submodular selection for data summarization in Python

jmschrei/apricot 8 Jun 2019

This paper presents an explanation of submodular selection, an overview of the features in apricot, and an application to several data sets.

Fast and Accurate Least-Mean-Squares Solvers

ibramjub/Fast-and-Accurate-Least-Mean-Squares-Solvers NeurIPS 2019

Least-mean squares (LMS) solvers such as Linear / Ridge / Lasso-Regression, SVD and Elastic-Net not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as decision trees and matrix factorizations.

Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?

easeml/datascope CVPR 2021

Quantifying the importance of each training point to a learning task is a fundamental problem in machine learning and the estimated importance scores have been leveraged to guide a range of data workflows such as data summarization and domain adaption.

Streaming Submodular Maximization under a $k$-Set System Constraint

ehsankazemi/streamingkextendible 9 Feb 2020

In this paper, we propose a novel framework that converts streaming algorithms for monotone submodular maximization into streaming algorithms for non-monotone submodular maximization.

CO-Optimal Transport

PythonOT/COOT NeurIPS 2020

Optimal transport (OT) is a powerful geometric and probabilistic tool for finding correspondences and measuring similarity between two distributions.

Deuteros 2.0: Peptide-level significance testing of data from hydrogen deuterium exchange mass spectrometry

andymlau/Deuteros_2.0 17 May 2020

There are currently very few software packages available that offer quick and informative comparison of HDX-MS datasets and even few-er which offer statistical analysis and advanced visualization.

Understanding collections of related datasets using dependent MMD coresets

sinead/dmmd 24 Jun 2020

Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets.

$β$-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers

dionman/beta-cores 31 Aug 2020

Modern machine learning applications should be able to address the intrinsic challenges arising over inference on massive real-world datasets, including scalability and robustness to outliers.

Fair and Representative Subset Selection from Data Streams

FraFabbri/fair-subset-datastream 9 Oct 2020

We study the problem of extracting a small subset of representative items from a large data stream.