Malaka Group

7 Feb 2024

Adi Feb

a new method for bringing privacy-preserving synthetic data closer to its real-world analog to improve the predictive value of models trained on it.

A revolution in how businesses handle customer data could be around the corner, and it’s based entirely on made-up information.

Banks, health care providers, and other highly regulated fields are sitting on piles of spreadsheet data that could be mined for insights with AI — if only it were easier and safer to access. Sharing data, even internally, comes with a high risk of leaking or exposing sensitive information. And the risks have only increased with the passage of new data-privacy laws in many countries.

Synthetic data has become an essential alternative. This is data that’s been generated algorithmically to mimic the statistical distribution of real data, without revealing information that could be used to reconstruct the original sample. Synthetic data lets companies build predictive models, and quickly and safely test new ideas before going to the effort of validating them on real data.

The standard security guarantee for synthetic data is something called differential privacy. It’s a mathematical framework for proving that synthetic data can’t be traced to its real-world analog, ensuring that it can be analyzed without revealing personal or sensitive information.

But there’s a trade-off. Exactly capturing the statistical properties of the original sample is virtually impossible. Synthetic data with privacy guarantees is always an approximation, which means that predictions made by models trained on it can also be skewed.

Much of the customer data that enterprises collect is in spreadsheet form; words and values organized into rows and columns. “The biggest problem we’re trying to solve is how to recreate highly structured, relational datasets with privacy guarantees,” said Akash Srivastava, synthetic data lead at IBM Research. “Most machine-learning models treat data points as independent, but tabular data is full of relationships.”

The more relationships that are embedded, the greater the chance that someone’s identity might be revealed — even after personal information has been disguised as synthetic data. Businesses typically get around this by adding more statistical noise to their synthetic data to guard against privacy breaches. But the noisier the data gets, the less predictive it becomes.

Unskilled predictions can be especially problematic for groups underrepresented in the original data. A model trained on misleading synthetic data, for example, could recommend rejecting minority loan applicants that would be considered qualified if real data had been used.

IBM researchers have proposed a solution. It’s a technique that allows businesses to essentially clean up the synthetic data they’ve already made so that it performs better on the target task. This means it could be more accurate with something like predicting whether a loan will be repaid.

The solution, to be presented at NeurIPS 2023, brings together an idea from the 1970s called information projection and a standard optimization method known as a compositional proximal gradient algorithm.

“The mishandling of sensitive data can expose companies to huge liabilities,” said the study’s lead author Hao Wang, an IBM researcher at the MIT-IBM Watson AI Lab. “We want to make it easy for data curators to share their data without worrying about privacy breaches.”