College of Information and Computer Sciences Logo

Human-Machine Collaboration for Knowledge Base Construction (Andrew McCallum’s research highlighted)

Sun January 15, 2012

Wikipedia’s impact has been revolutionary.  The collaboratively edited encyclopedia has transformed the way many people learn, browse new interests, share knowledge and make decisions.  Its information is mainly represented in natural language text.  However, in many domains––from disaster recovery to bio-medicine––more structured information is useful because it better supports pattern analysis and decision-making.

Such structured information must usually be gathered and assembled from disparate sources.  Sometimes this task is performed by humans, but it can be accomplished at much greater scales and speed by information extraction (IE), which automatically populates a database with relevant subsequences of text such as web pages, Twitter™ messages and research articles.  Although both humans and automated IE sometimes make mistakes, they have complementary strengths.

Professor Andrew McCallum sees rising interest in structured knowledge bases with Wikipedia-style breadth and collaborative derivation.  His research strives to enable a similar revolution in the creation of such structured knowledge bases by enabling the robust integration of machine-provided information extraction from large text collections with human-provided edits by a diverse population of contributors.

For the past decade he has been working on information extraction and data mining.  “Although information extraction and data mining appear together in many applications, their interface in most current systems would better be described as serial juxtaposition than as tight integration,” says McCallum. “Information extraction is usually not aware of the emerging patterns and regularities in the database.  Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties.  The result is that the accuracy of both suffers,” he adds.

McCallum’s research has focused on probabilistic models that perform joint inference across multiple components of an information processing pipeline in order to avoid the brittle accumulation of errors.  “The need for joint inference appears not only in extraction and data mining, but also in natural language processing, computer vision, robotics and anywhere we must reason at multiple layers of abstraction,” McCallum declares.  “I believe that joint inference is one of the most fundamental issues in artificial intelligence.”

His lab has built up a series of successes showing the benefits of joint inference.  In 2005 his group won a KDD competition in entity resolution with new research on a conditional random field that jointly accounts for multiple pairwise compatibilities.  In 2008 his group obtained new state-of-the-art results on benchmark tasks in ontology- and schema-matching using a joint model trained by their new Metropolis-Hastings-embedded parameter estimation method, SampleRank.  In 2011 Postdoctoral Fellow Sebastian Riedel and McCallum won multiple tracks of the BioNLP competition in collaboration with researchers from Stanford who provided features from their parser; the key to this success was the use of dual decomposition for joint inference among multiple constraints on the protein interaction events extracted.

“These results demonstrate joint inference being performed among relatively small sets of variables, but I think the biggest gains will come from joint inference across document boundaries or across entire databases,” says McCallum.  “Reasoning about data at this scale quickly involves more statistical random variables than can fit in machine memory, however.”  This problem has lead to McCallum’s more recent interest in probabilistic databases.  “We would like to use database technology not just for storing and querying the results of an IE system, but also for performing IE joint inference itself––managing the many random variables and intermediate results of IE,” he adds.  In a 2010 VLDB paper, graduate student Michael Wick and McCallum describe just such an approach in which raw textual evidence is presented to the database, and IE inference is performed ‘inside the database’ using Markov-chain Monte Carlo.  “We have taken to calling this an Epistemological Database, indicating that the database doesn’t observe the truth; it must infer the truth from available evidence,” McCallum explains.

‘Truth-discovering’ inference continues to run in these database systems as new evidence arrives.  New evidence can correctly cause the database to change its mind about its previous conclusions.  Since the probabilistic database maintains IE’s intermediate results, the system is able to re-visit targeted portions of inference without re-running it from scratch.  This is an extremely practical feature since in IE it is common for new documents to arrive in a stream over years of operation.  Evidence for the database may include not only additional documents but also new structured records and additional partial databases to be integrated.

This approach also leads directly to a compelling approach for handling human edits.  Some traditional approaches use the edits to directly modify the database’s notion of ‘the truth.’  But this is deeply unsatisfying for several reasons: sometimes human edits will be wrong; sometimes humans disagree; and sometimes a correct human edit should be overwritten by IE after the passing of time has made the old edit no longer valid.  “A better approach is to model human edits as additional ‘mini-documents’ (for example ‘User X said Y is true on April 2’) to be treated as evidence and reasoned about.  In this framework we can perform probabilistic reasoning not only about the IE process, and which human edits to incorporate, but also simultaneously about reliability and reputation of the human editors themselves,” says McCallum.

McCallum is currently applying these ideas to the construction of knowledge bases of the scientific research literature, with the aim of improving scientific collaboration.  “As science becomes more interdisciplinary and complex, it becomes all the more necessary to have tools to manage the intellectual landscape of scientific ideas and collaborations,” says McCallum.  “By building systems that gather and integrate information about scientists and their work, we can provide tools that will help scientists find collaborators, understand the relations of their work to neighboring scientific fields, translate vocabulary among fields, summarize trends, understand emergence of ideas, know which papers to read, find new scientific sub-areas for fruitful investigation, and identify good candidates for hiring students, postdocs and faculty.  So many of these decisions are currently based on myopic local views and serendipity.  We hope to accelerate the rate of scientific progress by providing better tools.”

Preliminary results of this work can be found at  His work on probabilistic programming that supports this research can be found at

Professor McCallum received his Ph.D. from the University of Rochester in 1995.  He was a postdoc at Carnegie Mellon University, and later Vice President of Research and Development at a 170-person start-up company focusing on information extraction from the web.  Since 2002 he has directed the Information Extraction and Synthesis Laboratory at the University of Massachusetts Amherst.  In 2009 he was named an AAAI Fellow.  He has authored over 200 papers in multiple areas of artificial intelligence; his work has received over 22,000 citations.