Robust De-anonymization of Large Datasets-(How to Break Anonymity of the Netflix Prize Dataset)

We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
Datasets containing “micro-data,” that is, information about specific individuals, are increasingly becoming public—both in response to “open government” laws, and to support data mining research. Some datasets include legally protected information such as health histories; others contain individual preferences, purchases, and transactions, which many people may view as private or sensitive.
Privacy risks of publishing micro-data are well-known. Even if identifying information such as names, addresses, and Social Security numbers has been removed, the adversary can use contextual and background knowledge, as well as cross-correlation with publicly available databases, to re-identify individual data records. Famous re-identification attacks include de-anonymization of a Massachusetts hospital discharge database by joining it with with a public voter database [22], de-anonymization of individual DNA sequences [19], and privacy breaches caused by (ostensibly anonymized) AOL search data [12].
{Worth reading even if you don’t understand math—the researchers explain a great deal about de-identification and why it does not protect privacy. Narayana and Shmatikov write, ‘The privacy question is not “Does the average Netflix subscriber care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy can be compromised by analyzing the Netflix Prize dataset?” The answer to the latter question is, undoubtedly, yes. As shown by our experiments with cross-correlating non-anonymous records from the Internet Movie Database with anonymized Netflix records (see below), it is possible to learn sensitive non-public information about a person’s political or even sexual preferences. We assert that even if the vast majority of Netflix subscribers did not care about the privacy of their movie ratings (which is not obvious by any means), our analysis would still indicate serious privacy issues with the Netflix Prize dataset.~Dr. Deborah Peel, Patient Privacy Rights}

Leave a Reply

Your email address will not be published. Required fields are marked *