Data Valuation
47 papers with code • 0 benchmarks • 0 datasets
Data valuation in machine learning tries to determine the worth of data, or data sets, for downstream tasks. Some methods are task-agnostic and consider datasets as a whole, mostly for decision making in data markets. These look at distributional distances between samples. More often, methods look at how individual points affect performance of specific machine learning models. They assign a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some concepts of value depend on a specific model of interest, others are model-agnostic.
Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature, often based on concepts from collaborative game theory, but also from generalization estimates of neural networks, or optimal transport theory, among others.
Benchmarks
These leaderboards are used to track progress in Data Valuation
Libraries
Use these libraries to find Data Valuation models and implementationsMost implemented papers
Data Shapley: Equitable Valuation of Data for Machine Learning
As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions.
Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms
The most surprising result is that for unweighted $K$NN classifiers and regressors, the Shapley value of all $N$ data points can be computed, exactly, in $O(N\log N)$ time -- an exponential improvement on computational complexity!
The Shapley Value in Machine Learning
Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning.
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution
Many tasks in explainable machine learning, such as data valuation and feature attribution, perform expensive computation for each data point and are intractable for large datasets.
Data Valuation using Reinforcement Learning
To adaptively learn data values jointly with the target task predictor model, we propose a meta learning framework which we name Data Valuation using Reinforcement Learning (DVRL).
Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning
Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning.
Data Banzhaf: A Robust Data Valuation Framework for Machine Learning
To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion.
CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification
Our theoretical analysis shows the proposed value function is (essentially) the unique function that satisfies two desirable properties for evaluating data values in classification.
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value
As a result, it has been recognized as infeasible to apply to large datasets.
OpenDataVal: a Unified Benchmark for Data Valuation
Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset.