- Resisting Structural Re-identification in Anonymized Social Networks
Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis
VLDB 2008
pdf X Y
Abstract We identify privacy risks associated with releasing network data sets and provide an algorithm that mitigates those risks. A network consists of entities connected by links representing relations such as friendship, communication, or shared activity. Maintaining privacy when publishing networked data is uniquely challenging because an individual's network context can be used to identify them even if other identifying information is removed. In this paper, we quantify the privacy risks associated with three classes of attacks on the privacy of individuals in networks, based on the knowledge used by the adversary. We show that the risks of these attacks vary greatly based on network structure and size. We propose a novel approach to anonymizing network data that models aggregate network structure and then allows samples to be drawn from that model. The approach guarantees anonymity for network entities while preserving the ability to estimate a wide variety of network measures with relatively little bias.
@inproceedings{hay2008resisting,
Author = {Michael Hay and Gerome Miklau and David Jensen and Don Towsely and Philipp Weis},
Title = {An integrated, conditional model of information extraction and coreference with application to citation matching},
Booktitle = {VLDB},
Year = {2008}
}
- Anonymizing social networks
Michael Hay, Gerome Miklau, David Jensen, Philipp Weis, and Siddharth Srivastava
University of Massachusetts Amherst Technical Report 2007
pdf X Y
Abstract Advances in technology have made it possible to collect data about individuals and the connections between them, such as email correspondence and friendships. Agencies and researchers who have collected such social network data often have a compelling interest in allowing others to analyze the data. However, in many cases the data describes relationships that are private (e.g., email correspondence) and sharing the data in full can result in unacceptable disclosures. In this paper, we present a framework for assessing the privacy risk of sharing anonymized network data. This includes a model of adversary knowledge, for which we consider several variants and make connections to known graph theoretical results. On several real-world social networks, we show that simple anonymization techniques are inadequate, resulting in substantial breaches of privacy for even modestly informed adversaries. We propose a novel anonymization technique based on perturbing the network and demonstrate empirically that it leads to substantial reduction of the privacy threat. We also analyze the effect that anonymizing the network has on the utility of the data for social network analysis.
@techreport{hay2007anonymizing,
Author = {Michael Hay and Gerome Miklau and David Jensen and Philipp Weis and Siddharth Srivastava},
Title = {Anonymizing Social Networks},
Institution = {University of Massachusetts Amherst},
Number = {07-19},
Year = {2007}
}
- Understanding the effects of search constraints on structure learning
Michael Hay, Andrew Fast, and David Jensen
University of Massachusetts Amherst Technical Report 2007
pdf X Y
Abstract Recently, Tsamardinos et al. [2006] presented an algorithm for Bayesian network structure learning that outperforms many state-of-the-art algorithms in terms of efficiency, structure similarity and likelihood. The Max-Min Hill Climbing algorithm is a hybrid of constraint-based and search-and-score techniques, using greedy hill climbing to search a constrained space of possible network structures. The constraints correspond to assertions of conditional independence that must hold in the network from which the data were sampled. One would expect that constraining the space would make search both faster and more accurate, focusing search on the "right" part of the space. The published results indicate, however, that the resulting structures are less accurate when search is constrained. We reproduce these results and explain why they occur. At small samples, the statistical test of conditional independence has low power, which causes the algorithm to exclude edges between dependent variables. Also, the constraints make search relatively harder, leading to errors in edge orientation. In an unconstrained space, search can "repair" these errors by adding in extra edges. We conclude by proposing and evaluating an improved algorithm.
@techreport{hay2007understanding,
Author = {Michael Hay and Andrew Fast and David Jensen},
Institution = {University of Massachusetts Amherst},
Title = {Understanding the effects of search constraints on structure learning},
Number = {07-21},
Year = {2007}
}
-
An integrated, conditional model of information extraction and coreference with application to citation matching
Ben Wellner, Andrew McCallum, Fuchun Peng and Michael Hay
Conference on Uncertainty in Artificial Intelligence (UAI) 2004
pdf X Y
Abstract Although information extraction and coreference resolution appear together in many applications, most current systems perform them as independent steps. This paper describes an approach to integrated inference for extraction and coreference based on conditionally-trained undirected graphical models. We discuss the advantages of conditional probability training, and of a coreference model structure based on graph partitioning. On a data set of research paper citations, we show significant reduction in error by using extraction uncertainty to improve coreference citation matching accuracy, and using coreference to improve the accuracy of the extracted fields.
@inproceedings{wellner2004an-integrated,
Author = {Ben Wellner and Andrew McCallum and Fuchun Peng and Michael Hay},
Title = {An integrated, conditional model of information extraction and coreference with application to citation matching},
Booktitle = {Proceedings of the 20th conference on Uncertainty in Artificial Intelligence},
Pages = {593--601},
Year = {2004}
}
-
Exploiting relational structure to understand publication patterns in high-energy physics
Amy McGovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen
SIGKDD Explorations 2003 (Winning entry in the KDD Cup)
pdf X Y
Abstract We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus our analyses on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text. Each of these analyses contributes to an overall understanding of theoretical high-energy physics that could not have been achieved without examining each area in detail.
@article{mcgovern2003exploiting,
Author = {A. McGovern and L. Friedland and M. Hay and B. Gallagher and A. Fast and J. Neville and D. Jensen},
Journal = {SIGKDD Explorations},
Title = {Exploiting relational structure to understand publication patterns in high-energy physics},
Year = {2003}
}
-
Learning relational probability trees
Jennifer Neville, David Jensen, Lisa Friedland, and Michael Hay
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2003
pdf X Y
Abstract Classification trees are widely used in the machine learning and data mining communities for modeling propositional data. Recent work has extended this basic paradigm to probability estimation trees. Traditional tree learning algorithms assume that instances in the training data are homogeneous and independently distributed. Relational probability trees (RPTs) extend standard probability estimation trees to a relational setting in which data instances are heterogeneous and interdependent. Our algorithm for learning the structure and parameters of an RPT searches over a space of relational features that use aggregation functions (e.g. AVERAGE, MODE, COUNT) to dynamically propositionalize relational data and create binary splits within the RPT. Previous work has identified a number of statistical biases due to characteristics of relational data such as auto-correlation and degree disparity. The RPT algorithm uses a novel form of randomization test to adjust for these biases. On a variety of relational learning tasks, RPTs built using randomization tests are significantly smaller than other models and achieve equivalent, or better, performance.
@inproceedings{neville2003learning,
Author = {Jennifer Neville and David Jensen and Lisa Friedland and Michael Hay},
Title = {Learning relational probability trees},
Booktitle = {Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining},
Doi = {http://doi.acm.org/10.1145/956750.956830},
Pages = {625--630},
Publisher = {ACM Press},
Year = {2003}
}
-
Avoiding bias when aggregating relational data with degree disparity
David Jensen, Jennifer Neville, and Michael Hay
International Conference on Machine Learning (ICML) 2003
pdf X Y
Abstract A common characteristic of relational data sets -- degree disparity -- can lead relational learning algorithms to discover misleading correlations. Degree disparity occurs when the frequency of a relation is correlated with the values of the target
variable. In such cases, aggregation functions used by many relational learning algorithms will result in misleading correlations and added complexity in models. We examine this problem through a combination of simulations and experiments. We show how two novel hypothesis testing procedures can adjust for the effects of using aggregation functions in the presence of degree disparity.
@inproceedings{jensen2003avoiding,
Author = {David Jensen and Jennifer Neville and Michael Hay},
Title = {Avoiding Bias when Aggregating Relational Data with Degree Disparity},
Booktitle = {Proceedings of the Twentieth International Conference on Machine Learning (ICML)},
Pages = {274--281},
Year = {2003}
}