Andrew McCallum

Contact Info
Bio & Affiliations
Vita
Teaching
Publications
Research & Projects
Code & Data
Students & other collab's
Activities & Events
Personal

Links:
UMass ML Seminar


Research and Projects

The goal of my current research is to dramatical increase our ability to mine actionable knowledge from unstructured text. I am especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature & community.

Toward this end my group and I develop and employ various methods in statistical machine learning, natural language processing and information retrieval. We tend toward probabilistic approaches, graphical models, and Bayesian methods. Recently we have been working with various conditionally-trained undirected graphical models (conditional random fields)---for finite-state sequence segmentation and tagging, coreference analysis, relational classification, and other problems.

Unified Information Extraction and Data Mining

Although information extraction and data mining appear together in many applications, their interface in most current deployments would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach.

We have been researching relational probabilistic models that unify extraction and mining, so that by sharing common inference procedures, they can each overcome the weaknesses of the other. For example, data mining run on a partially-filled database can find patterns that provide "top-down" accuracy-improving constraints to information extraction. Information extraction can provide a much richer set of "bottom-up" hypotheses to data mining if the mining is able to handle additional uncertainty information from extraction.

Intelligent Understanding of our Email World

As part of the CALO project, we are extracting information about people and other entities appearing in email streams.

Conditional Probability Models for Sequences and other Relational Data

After having some success using hidden Markov models for information extraction, we found ourselves frustrated by their lack of ability to incorporate many arbitrary, overlapping features of the input sequence, such as capitalization, lexicon memberships, spelling features, and conjunctions of such features in a large window of past and future observations. The same difficulties with non-independent features exist in many generatively-trained models historically used in NLP. We have begun work with conditionally-trained probability models that address these problems. Maximum entropy Markov models are locally-normalized conditional sequence models. Finite-state Conditional Random Fields (CRFs) are globally-normalized models. We have also been working with CRFs for coreference and multi-sequence labeling, analogous to conditionally-trained Dynamic Bayesian Networks (DBNs).

WhizBang Labs

From 2000 through 2002 I was Vice President of Research and Development at WhizBang Labs, a start-up company focusing on information extraction from the Web. We developed sophisticated machine learning extraction systems for numerous application domains---among them FlipDog.com, a database of job openings extracted directly from company Web sites (now owned by Monster.com), corporate information for Dun & Bradstreet and Lexis Nexis, and course syllabi for the U.S. Department of Labor.

Cora Research Paper Search Engine

I was the leader of the project at JustResearch that created Cora, a domain-specific search engine over computer science research papers. It currently contains over 50,000 postscript papers. You can read more about our research on Cora in our IRJ journal paper or a paper presented at the AAAI'99 Spring Symposium. The Cora team also included Kamal Nigam, Kristie Seymore, Jason Rennie, Huan Chang and Jason Reed.

WebKB

In 1996 and 1997 I was part of Tom Mitchell's WebKB project and the CMU Text Learning group.

Reinforcement Learning

In what now seems like a lifetime ago, I was interested in reinforcement learning---especially with hidden state and factored representations. My thesis uses memory-based learning and a robust statistical test on reward in order to learn a structured policy representation that makes perceptual and memory distinctions only where needed for the task at hand. It can also be understood as a method of Value Function Approximation. The model learned is an order-n partially observable Markov decision process. It handles noisy observation, action and reward.

It is related to Ron, Singer and Tishby's Probabilistic Suffix Trees, Leslie Kaelbling's G-algorithm and Andrew Moore's Parti-game. It is distinguished from similar-era work by Michael Littman, Craig Boutilier and others in that it learns both a model and a policy, and is quite practical with infinite-horizon tasks and large state and observation spaces. Follow-on or comparison work has been done by Anders Jonsson, Andy Barto, Will Uther, Natalia Hernandez, Leslie Kaelbling, and Sridhar Mahadevan.

The algorithm, called U-Tree, was demonstrated solving a highway driving task using simulated eye-movements and deictic representations. The simulated environment has about 21000 states, 2500 observations, noise and much hidden state. After about 2 1/2 hours of simulated experience, U-Tree learns a task-specific model of the environment that has only 143 states. It's learned behavior included lane changes to avoid slow vehicles in front, and checking the rear-view mirror to avoid faster vehicles from behind.