Data Valuation

29 papers with code • 0 benchmarks • 0 datasets

Data valuation in machine learning tries to determine the worth of data, or data sets, for downstream tasks. Some methods are task-agnostic and consider datasets as a whole, mostly for decision making in data markets. These look at distributional distances between samples. More often, methods look at how individual points affect performance of specific machine learning models. They assign a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some concepts of value depend on a specific model of interest, others are model-agnostic.

Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature, often based on concepts from collaborative game theory, but also from generalization estimates of neural networks, or optimal transport theory, among others.


Use these libraries to find Data Valuation models and implementations

Most implemented papers

Data Shapley: Equitable Valuation of Data for Machine Learning

amiratag/DataShapley 5 Apr 2019

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions.

Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

AI-secure/KNN-PVLDB 22 Aug 2019

The most surprising result is that for unweighted $K$NN classifiers and regressors, the Shapley value of all $N$ data points can be computed, exactly, in $O(N\log N)$ time -- an exponential improvement on computational complexity!

Data Valuation using Reinforcement Learning

google-research/google-research ICML 2020

To adaptively learn data values jointly with the target task predictor model, we propose a meta learning framework which we name Data Valuation using Reinforcement Learning (DVRL).

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

ykwon0407/beta_shapley 26 Oct 2021

Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning.

The Shapley Value in Machine Learning

benedekrozemberczki/shapley 11 Feb 2022

Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning.

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning

Jiachen-T-Wang/data-banzhaf 30 May 2022

To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion.

CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification

stephanieschoch/cs-shapley 13 Nov 2022

Our theoretical analysis shows the proposed value function is (essentially) the unique function that satisfies two desirable properties for evaluating data values in classification.

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

ykwon0407/dataoob 16 Apr 2023

As a result, it has been recognized as infeasible to apply to large datasets.

Towards Efficient Data Valuation Based on the Shapley Value

aai-institute/pyDVL 27 Feb 2019

In this paper, we study the problem of data valuation by utilizing the Shapley value, a popular notion of value which originated in cooperative game theory.

Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning

aai-institute/pyDVL 13 Jul 2021

The Shapley value (SV) and Least core (LC) are classic methods in cooperative game theory for cost/profit sharing problems.