- Boosting the Accuracy of Differentially-Private Queries Through Consistency
Michael Hay, Vibhor Rastogi, Gerome Miklau, Dan Suciu
U. of Massachusetts Amherst Technical Report UM-CS-2009-013
arXiv
X Y
Abstract Recent differentially private query mechanisms offer strong privacy guarantees by adding noise to the query answer. For a single counting query, the technique is simple, accurate, and provides optimal utility. However, analysts typically wish to ask multiple queries. In this case, the optimal strategy is not apparent, and alternative query strategies can involve difficult trade-offs in accuracy, and may produce inconsistent answers.
In this work we show that it is possible to significantly improve accuracy for a general class of histogram queries. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The final output is both private and consistent, but in addition, it is often much more accurate.
We apply our techniques to real datasets and show they can be used for estimating the degree sequence of a graph with extreme precision, and for computing a histogram that can support arbitrary range queries accurately.
@techreport{hay2009boosting,
Author = {Michael Hay and Vibhor Rastogi and Gerome Miklau and Dan Suciu},
Title = {Boosting the Accuracy of Differentially-Private Queries Through Consistency},
Institution = {University of Massachusetts Amherst},
Number = {UM-CS-2009-013},
Year = {2009}
}
-
Relationship Privacy: Output Perturbation for Queries with Joins
Vibhor Rastogi, Michael Hay, Gerome Miklau, Dan Suciu
PODS 2009
pdf
full
X Y
Abstract We study privacy-preserving query answering over data containing relationships. A social network is a prime example of such data, where the nodes represent individuals and edges represent relationships. Nearly all interesting queries over social networks involve joins, and for such queries, existing output perturbation algorithms severely distort query answers. We propose an algorithm that significantly improves utility over competing techniques, typically reducing the error bound from polynomial in the number of nodes to poly-logarithmic. The algorithm is, to the best of our knowledge, the first to answer such queries with acceptable accuracy, even for worst-case inputs.
The improved utility is achieved by relaxing the privacy condition. Instead of ensuring strict differential privacy, we guarantee a weaker (but still quite practical) condition based on adversarial privacy. To explain precisely the nature of our relaxation in privacy, we provide a new result that characterizes the relationship between ǫ-indistinguishability (a variant of the differential privacy definition) and adversarial privacy, which is of independent interest: an algorithm is \epsilon-indistinguishable iff it is private for a particular class of adversaries (defined precisely herein). Our perturbation algorithm guarantees privacy against adversaries in this class whose prior distribution is numerically bounded.
@inproceedings{rastogi2009relationship-privacy,
Author = {Vibhor Rastogi and Michael Hay and Gerome Miklau and Dan Suciu},
Title = {Relationship Privacy: Output Perturbation for Queries with Joins},
Booktitle = {PODS},
Year = {2009}
}
- Resisting Structural Re-identification in Anonymized Social Networks
Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis
VLDB 2008
pdf X Y
Abstract We identify privacy risks associated with releasing network data sets and provide an algorithm that mitigates those risks. A network consists of entities connected by links representing relations such as friendship, communication, or shared activity. Maintaining privacy when publishing networked data is uniquely challenging because an individual's network context can be used to identify them even if other identifying information is removed. In this paper, we quantify the privacy risks associated with three classes of attacks on the privacy of individuals in networks, based on the knowledge used by the adversary. We show that the risks of these attacks vary greatly based on network structure and size. We propose a novel approach to anonymizing network data that models aggregate network structure and then allows samples to be drawn from that model. The approach guarantees anonymity for network entities while preserving the ability to estimate a wide variety of network measures with relatively little bias.
@inproceedings{hay2008resisting,
Author = {Michael Hay and Gerome Miklau and David Jensen and Don Towsely and Philipp Weis},
Title = {Resisting Structural Re-identification in Anonymized Social Networks},
Booktitle = {VLDB},
Year = {2008}
}
- Anonymizing social networks
Michael Hay, Gerome Miklau, David Jensen, Philipp Weis, and Siddharth Srivastava
University of Massachusetts Amherst Technical Report 2007
pdf X Y
Abstract Advances in technology have made it possible to collect data about individuals and the connections between them, such as email correspondence and friendships. Agencies and researchers who have collected such social network data often have a compelling interest in allowing others to analyze the data. However, in many cases the data describes relationships that are private (e.g., email correspondence) and sharing the data in full can result in unacceptable disclosures. In this paper, we present a framework for assessing the privacy risk of sharing anonymized network data. This includes a model of adversary knowledge, for which we consider several variants and make connections to known graph theoretical results. On several real-world social networks, we show that simple anonymization techniques are inadequate, resulting in substantial breaches of privacy for even modestly informed adversaries. We propose a novel anonymization technique based on perturbing the network and demonstrate empirically that it leads to substantial reduction of the privacy threat. We also analyze the effect that anonymizing the network has on the utility of the data for social network analysis.
@techreport{hay2007anonymizing,
Author = {Michael Hay and Gerome Miklau and David Jensen and Philipp Weis and Siddharth Srivastava},
Title = {Anonymizing Social Networks},
Institution = {University of Massachusetts Amherst},
Number = {07-19},
Year = {2007}
}
-
An integrated, conditional model of information extraction and coreference with application to citation matching
Ben Wellner, Andrew McCallum, Fuchun Peng and Michael Hay
Conference on Uncertainty in Artificial Intelligence (UAI) 2004
pdf X Y
Abstract Although information extraction and coreference resolution appear together in many applications, most current systems perform them as independent steps. This paper describes an approach to integrated inference for extraction and coreference based on conditionally-trained undirected graphical models. We discuss the advantages of conditional probability training, and of a coreference model structure based on graph partitioning. On a data set of research paper citations, we show significant reduction in error by using extraction uncertainty to improve coreference citation matching accuracy, and using coreference to improve the accuracy of the extracted fields.
@inproceedings{wellner2004an-integrated,
Author = {Ben Wellner and Andrew McCallum and Fuchun Peng and Michael Hay},
Title = {An integrated, conditional model of information extraction and coreference with application to citation matching},
Booktitle = {Proceedings of the 20th conference on Uncertainty in Artificial Intelligence},
Pages = {593--601},
Year = {2004}
}
-
Exploiting relational structure to understand publication patterns in high-energy physics
Amy McGovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen
SIGKDD Explorations 2003 (Winning entry in the KDD Cup)
pdf X Y
Abstract We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus our analyses on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text. Each of these analyses contributes to an overall understanding of theoretical high-energy physics that could not have been achieved without examining each area in detail.
@article{mcgovern2003exploiting,
Author = {A. McGovern and L. Friedland and M. Hay and B. Gallagher and A. Fast and J. Neville and D. Jensen},
Journal = {SIGKDD Explorations},
Title = {Exploiting relational structure to understand publication patterns in high-energy physics},
Year = {2003}
}
-
Learning relational probability trees
Jennifer Neville, David Jensen, Lisa Friedland, and Michael Hay
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2003
pdf X Y
Abstract Classification trees are widely used in the machine learning and data mining communities for modeling propositional data. Recent work has extended this basic paradigm to probability estimation trees. Traditional tree learning algorithms assume that instances in the training data are homogeneous and independently distributed. Relational probability trees (RPTs) extend standard probability estimation trees to a relational setting in which data instances are heterogeneous and interdependent. Our algorithm for learning the structure and parameters of an RPT searches over a space of relational features that use aggregation functions (e.g. AVERAGE, MODE, COUNT) to dynamically propositionalize relational data and create binary splits within the RPT. Previous work has identified a number of statistical biases due to characteristics of relational data such as auto-correlation and degree disparity. The RPT algorithm uses a novel form of randomization test to adjust for these biases. On a variety of relational learning tasks, RPTs built using randomization tests are significantly smaller than other models and achieve equivalent, or better, performance.
@inproceedings{neville2003learning,
Author = {Jennifer Neville and David Jensen and Lisa Friedland and Michael Hay},
Title = {Learning relational probability trees},
Booktitle = {Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining},
Doi = {http://doi.acm.org/10.1145/956750.956830},
Pages = {625--630},
Publisher = {ACM Press},
Year = {2003}
}
-
Avoiding bias when aggregating relational data with degree disparity
David Jensen, Jennifer Neville, and Michael Hay
International Conference on Machine Learning (ICML) 2003
pdf X Y
Abstract A common characteristic of relational data sets -- degree disparity -- can lead relational learning algorithms to discover misleading correlations. Degree disparity occurs when the frequency of a relation is correlated with the values of the target
variable. In such cases, aggregation functions used by many relational learning algorithms will result in misleading correlations and added complexity in models. We examine this problem through a combination of simulations and experiments. We show how two novel hypothesis testing procedures can adjust for the effects of using aggregation functions in the presence of degree disparity.
@inproceedings{jensen2003avoiding,
Author = {David Jensen and Jennifer Neville and Michael Hay},
Title = {Avoiding Bias when Aggregating Relational Data with Degree Disparity},
Booktitle = {Proceedings of the Twentieth International Conference on Machine Learning (ICML)},
Pages = {274--281},
Year = {2003}
}