Don't encrypt the data; just approximate the model \ Towards Secure Transaction and Fair Pricing of Training Data

ICLR 2018 · Xinlei Xu ·

As machine learning becomes ubiquitous, deployed systems need to be as accu- rate as they can. As a result, machine learning service providers have a surging need for useful, additional training data that benefits training, without giving up all the details about the trained program. At the same time, data owners would like to trade their data for its value, without having to first give away the data itself be- fore receiving compensation. It is difficult for data providers and model providers to agree on a fair price without first revealing the data or the trained model to the other side. Escrow systems only complicate this further, adding an additional layer of trust required of both parties. Currently, data owners and model owners don’t have a fair pricing system that eliminates the need to trust a third party and training the model on the data, which 1) takes a long time to complete, 2) does not guarantee that useful data is paid valuably and that useless data isn’t, without trusting in the third party with both the model and the data. Existing improve- ments to secure the transaction focus heavily on encrypting or approximating the data, such as training on encrypted data, and variants of federated learning. As powerful as the methods appear to be, we show them to be impractical in our use case with real world assumptions for preserving privacy for the data owners when facing black-box models. Thus, a fair pricing scheme that does not rely on secure data encryption and obfuscation is needed before the exchange of data. This pa- per proposes a novel method for fair pricing using data-model efficacy techniques such as influence functions, model extraction, and model compression methods, thus enabling secure data transactions. We successfully show that without running the data through the model, one can approximate the value of the data; that is, if the data turns out redundant, the pricing is minimal, and if the data leads to proper improvement, its value is properly assessed, without placing strong assumptions on the nature of the model. Future work will be focused on establishing a system with stronger transactional security against adversarial attacks that will reveal details about the model or the data to the other party.

PDF Abstract