In today’s world, upholding the right to privacy is challenging, to say the least. Data protection and privacy legislation exists in 70% of countries around the world. The need for data that feeds the voracious appetite of machine learning algorithms has made that data an indispensable part of doing business in the modern world.
There have been many high-profile data breaches including the watershed 2015 Cambridge Analytica scandal where users’ data was collected without consent. Between 2017 and 2019 there were some 4,395 breaches which exposed over 830 trillion records.
One way that countries have tried to ensure privacy is to limit the collection and processing of personal data at the same time the need for this data only grows. The US Congress passed privacy legislation as early as 1971 protecting “personally identifiable information” or PII. That legislation led to the creation of HIPAA for health data and FCRA for finance. Europe famously passed GDPR which contains stricter regulations and requires obtaining explicit consent before collecting personal data. Images are included in these laws, especially those that contain bio-identifiers.
But the data keeps building and the privacy requirements have a hard time keeping up. Companies have resorted to using anonymization or k-anonymization which removes all PII from a dataset. K-anonymity guarantees that each person in a dataset is indistinguishable from other people. There are flaws with these anonymity approaches. One flaw is that anonymized datasets can be reidentified, meaning that you work backwards with publicly available data to re-identify the actual person in the dataset. There have also been breaches of re-identification, including the Human Genome Project where re-identification was used to identify thousands of individuals in 2004, 2008 and 2013.
Another problem is that the utility of a dataset drops the more the information is redacted. Distinct characteristics are lost which begs the questions, how can we share useful information while preserving privacy?
The answer in short, is synthetic data. Synthetic data holds the promise to be close enough to original data to be useful yet different enough from unique individuals to preserve privacy. Synthetic data can be created in multiple ways; from comprehensive 3D modeling to frameworks such as Generative Adversarial Network (GAN). GANs have become more sophisticated with more control over image synthesis and flexibility of generation. If you are creating a face, you can control the identity of the person, the lighting and the background, for example.
Synthetic data can alleviate privacy concerns because one can verify that generated data points are sufficiently different from original, real world data points. Because the synthetic data is not directly linked to any individual, the use of synthetic data points minimizes the risk of leaking PII.
A growing body of research attests to the effectiveness of synthetic data in addressing privacy concerns. For instance, when GAN is used to generate patient records, there are limited privacy risks in both identity and attributes. This is especially true when noise is added to training data points.
Differential privacy – a rigorous mathematical definition of privacy – has changed what it means to be private. GANs that generate synthetic images can be differentially private. Research has demonstrated that one can guarantee that differentially private synthetic images do not contain any sensitive personal information.
Synthetic data has benefits beyond privacy. With synthetic data there is no need to manually label data which is slow and painful; large, labeled datasets can be generated with minimal lead time.
Real data can be inaccurate, incomplete and biased. The higher control one has over synthetic data generation creates higher quality data all around. Real data lacks diversity and doesn’t account for edge, or unusual cases, where synthetic data can supplement existing gaps in real-world datasets.
Synthetic data is the best bet for solving privacy concerns when it comes to data and PII. This, along with its other benefits is one of the reasons that synthetic data is pervasive in industries where privacy is a priority.