Our DQ0 tool, which allows researchers to access patient data in accordance with DSAGMs, relies on differential privacy for data protection. What exactly lies behind this most advanced means of data protection?
A data holder promises to a person whose data he or she wishes to process that they will not be adversely affected, or not at all affected, by the processing of his or her data while performing an analysis. This is regardless of which other data is available from studies, data sets or third-party sources of information.
Differential privacy can, at best, make data generally available with maximum data protection, without having to carry out complex anonymization or masking operations on the data beforehand.
Differential privacy is about learning general statements about the totality of a data set, but nothing about individual data points (individuals). The following example will illustrate the challenge of data protection in statistical analysis and the solution that differential privacy offers:
A and B both work as analysts at a consulting firm. They both have access to a secure database that contains personal data. Among other things, this database stores values about the income of the persons included. A and B now independently publish analyses of salary developments in certain regions at a time interval of several weeks.
Some of these regions are included in both analyses. The data on salaries is presented within the analyses as averages for all salaries of persons in the respective regions. The analyses also contain summed values about the size of the populations per region.
An external observer C now has access to both analyses, either by chance or by intention. He notes that in some regions the amounts, i.e. the number of people in these regions, differ only slightly.
He conducts further research and learns that there was some movement of people to and from these regions during the period between the two analyses of A and B. Through further research, he can identify for some of these regions the persons included in one analysis, but not the other. By comparing the salaries, he can determine the salary of individual persons. He can therefore obtain information about sensitive data that is worthy of protection, even though the analyses – considered separately – do not appear to reveal this information.
Nothing is changed
Differential privacy now promises exactly this: That adding or removing individual data points does not change anything (or only very little) of the analysis result. In this example, this can be achieved by adding random noise to the average salary and population totals. This means that it is no longer possible to make an exact comparison of both analyses. And yet the analyses each retain their significance.
Differential privacy therefore starts with the data query, not with the data itself. If multiple analyses of data from the same group are carried out, as long as each of the differential privacy analyses is sufficient, all published information, when taken together, remains “differentially private”. As long as the responses to data queries comply with the differential privacy promise, for example by adding noise to the aggregated results as described above, all responses are safe, even if combined.