Here’s how machine learning can violate your privacy

Machine learning has pushed the boundaries in several areas, including personalized medicine, self-driving cars and customized advertising. However, research has shown that these systems remember aspects of the data they were trained with to learn patterns, raising privacy concerns.

In statistics and machine learning, the goal is to learn from past data to make new predictions or inferences about future data. To achieve this goal, the statistician or machine learning expert selects a model to capture the suspected patterns in the data. A model applies a simplifying structure to the data, making it possible to learn patterns and make predictions.

Complex machine learning models have some inherent advantages and disadvantages. On the plus side, they can learn much more complex patterns and work with richer data sets for tasks such as image recognition and predicting how a specific person will respond to treatment.

However, they also run the risk of overfitting the data. This means that they make accurate predictions about the data they have been trained with, but they end up learning additional aspects of the data that are not directly related to the task at hand. This leads to models that are not generalized, meaning they perform poorly on new data that is the same type, but not exactly the same as the training data.

While there are techniques to address the predictive error associated with overfitting, there are also privacy concerns because we can learn so much from the data.

How machine learning algorithms draw conclusions

Each model has a certain number of parameters. A parameter is an element of a model that can be changed. Each parameter has a value or setting that the model infers from the training data. Parameters can be thought of as the different knobs that can be turned to influence the performance of the algorithm. While a rectilinear pattern has only two controls, the slope and the intercept, machine learning models have a large number of parameters. For example, the GPT-3 language model has 175 billion.

To choose the parameters, machine learning methods use training data with the aim of minimizing the predictive error on the training data. For example, if the goal is to predict whether a person would respond well to a particular medical treatment based on their medical history, the machine learning model would make predictions about the data where the model’s developers know whether a person has good or bad responded. The model is rewarded for predictions that are correct and penalized for incorrect predictions, which causes the algorithm to adjust its parameters (i.e., turn some “knobs”) and try again.

To avoid overfitting the training data, machine learning models are also checked against a validation dataset. The validation dataset is a separate dataset that is not used in the training process. By monitoring the performance of the machine learning model on this validation dataset, developers can ensure that the model can generalize what it has learned beyond the training data, thus avoiding overfitting.

While this process succeeds in ensuring good performance of the machine learning model, it does not directly prevent the machine learning model from remembering information in the training data.

Privacy concerns

Due to the large number of parameters in machine learning models, there is a chance that the machine learning method will remember some of the data it was trained on. In fact, this is a widespread phenomenon, and users can extract the stored data from the machine learning model by using queries tailored to obtain the data.

If the training data contains sensitive information, such as medical or genomic data, the privacy of the people whose data was used to train the model could be at risk. Recent research has shown that it is actually necessary for machine learning models to remember aspects of the training data to achieve optimal performance when solving certain problems. This indicates that there may be a fundamental trade-off between the performance of a machine learning method and privacy.

Machine learning models also make it possible to predict sensitive information using seemingly non-sensitive data. For example, Target was able to predict which customers were likely to be pregnant by analyzing the purchasing behavior of customers who had registered with the Target baby registry. Once the model was trained on this dataset, it was able to send pregnancy-related ads to customers suspected of being pregnant because they had purchased items such as supplements or unscented lotions.

Is privacy protection even possible?

While there have been many proposed methods to reduce memorization in machine learning methods, most have been largely ineffective. Currently, the most promising solution to this problem is to guarantee a mathematical limit on the privacy risk.

The state-of-the-art method for formal privacy protection is differential privacy. Differential privacy requires that a machine learning model not change much when an individual’s data in the training dataset is changed. Differential privacy methods achieve this guarantee by introducing additional randomness into the learning algorithms that ‘obscures’ the contribution of any given individual. Once a method is protected with differential privacy, no attack can violate that privacy guarantee.

However, even if a machine learning model is trained using differential privacy, this does not prevent it from making sensitive inferences, as in the Target example. To prevent these privacy violations, all data sent to the organization must be protected. This approach is called local differential privacy and Apple and Google have implemented it.

Because differential privacy limits how much the machine learning model can rely on one individual’s data, it prevents memorization. Unfortunately, it also limits the performance of the machine learning methods. Because of this trade-off, there have been criticisms of the usefulness of differential privacy, as it often results in a significant performance degradation.

Moving forward

Because of the tension between inferential learning and privacy concerns, there is ultimately a social question of which is more important in which contexts. When data does not contain sensitive information, it is easy to recommend using the most powerful machine learning methods available.

However, when working with sensitive data, it is important to weigh the consequences of privacy leaks, and it may be necessary to sacrifice some machine learning performance to protect the privacy of the people whose data trained the model.

This article is republished from The Conversation, an independent nonprofit organization providing facts and analysis to help you understand our complex world.

It was written by: Jordan Awan, Purdue University.

Read more:

Jordan Awan receives funding from the National Science Foundation and the National Institute of Health. He also serves as a privacy consultant for the federal nonprofit MITER.

Leave a Comment