Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression David Mimno and Andrew McCallum. To appear in UAI, 2008 (selected for plenary presentation) PDF

Text documents are usually accompanied by metadata, such as the authors, the publication venue, the date, and any references. Work in topic modeling that has taken such information into account, such as Author-Topic, Citation-Topic, and Topic-over-Time models, has generally focused on constructing specific models that are suited only for one particular type of metadata. This paper presents a simple, unified model for learning topics from documents given arbitrary non-textual features, which can be discrete, categorical, or continuous.

Modeling Career Path Trajectories David Mimno and Andrew McCallum. University of Massachusetts, Amherst Technical Report #2007-69, 2007. PDF

Descriptions of previous work experience in resumes are a valuable source of information about the structure of the job market and the economy. There is, however, a high degree of variability in these documents. Job titles are a particular problem, as they are often either overly sparse or overly general: 85% of job titles in our corpus occur only once, while the most common titles, such as "Consultant", are so broad as to be virtually meaningless. We use a hierarchical hidden state model to discover clusters of words that correspond to distinct skills, clusters of skills that correspond to jobs, and transition patterns between jobs.

Community-based Link Prediction with Text David Mimno, Hanna Wallach, and Andrew McCallum. Statistical Network Modeling Workshop, NIPS, 2007, Whistler, BC.

Expertise Modeling for Matching Papers with Reviewers David Mimno and Andrew McCallum. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2007, San Jose, CA. PDF Data

Science depends on peer review, but matching papers with reviewers is a challenging and time consuming task. We compare several automatic methods for measuring the similarity between a submitted abstract and papers previously written by reviewers. These include a novel topic model that automatically divides an author's papers into topically coherent "personas".

Probabilistic Representations for Integrating Unreliable Data Sources David Mimno, Andrew McCallum and Gerome Miklau. IIWeb workshop at AAAI 2007, Vancouver, BC, Canada. PDF

Mixtures of Hierarchical Topics with Pachinko Allocation. David Mimno, Wei Li and Andrew McCallum. International Conference on Machine Learning (ICML) 2007, Corvallis, OR. PDF

The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specic topics. This paper presents hierarchical PAM — an enhancement that explicitly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA's topical hierarchy representation with PAM's ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and human-generated categories such as journals.

Mining a digital library for influential authors. David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada. PDF

Most digital libraries let you search for documents, but we often want to search for people as well. We extract and disambiguate author names from online research papers, weight papers using PageRank on the citation graph, and expand queries using a topic model. We evaluate the system by comparing people returned for the query "information retrieval" to recipients of major awards in IR.

Organizing the OCA: Learning faceted subjects from a library of digital books. David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada. PDF

The Open Content Alliance is one of several large-scale digitization projects currently producing huge numbers of digital books. Statistical topic models are a natural choice for organizing and describing such large text corpora, but scalability becomes a problem when we are dealing with multi-billion word corpora. This paper presents a new method for topic modeling, DCM-LDA. In this model, we train an independent topic model for every book, using pages as "documents". We then gather the topics discovered, cluster them, and then fit a Dirichlet prior for each topic cluster. Finally, we retrain the individual book topic models using these new shared topics.

Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries. Gregory Crane, David Bamman, Lisa Cerrato, Alison Jones, David Mimno, Adrian Packel, David Sculley, and Gabriel Weaver. European Conference on Digital Libraries (ECDL) 2006, Alicante, Spain. PDF

Several groups are currently embarking on large scale digitization projects, but are they producing anything more than lots of raw text? This paper argues that such an investment in digitization will be more valuable if accompanied by a parallel investment in highly structured resources such as dictionaries. Several examples, including some I worked on while at Perseus, illustrate this effect.

Bibliometric Impact Measures Leveraging Topic Analysis. Gideon Mann, David Mimno and Andrew McCallum. Joint Conference on Digital Libraries (JCDL) 2006, Chapel Hill, NC. PDF Powerpoint

When evaluating the impact of research papers, it's important to compare similar papers: a massively influential paper in Mathematics may be as well cited as a middling paper in Molecular Biology. We present a system that combines automatic citation analysis on spidered research papers with a new automatic topic model that is aware of multi-word terms. This system is capable of finding fine-grained sub-fields while scaling to the exponential increase in open-access publishing. We evaluate papers from the Rexa digital library using both traditional bibliometric statistics (substituting topics for journals) as well as several new metrics.

Hierarchical Catalog Records: Implementing a FRBR Catalog. David Mimno, Alison Jones and Gregory Crane. DLib, October 2005. HTML

Finding a Catalog: Generating Analytical Catalog Records from Well-structured Digital Texts. David Mimno, Alison Jones and Gregory Crane. Joint Conference on Digital Libraries (JCDL) 2005, Denver, CO. PDF.

Services for a Customizable Authority Linking Environment. Mark Patton and David Mimno. demonstration at Joint Conference on Digital Libraries (JCDL) 2004, Tucson, AZ.

Towards a Cultural Heritage Digital Library. Gregory Crane, Clifford E. Wulfman, Lisa M. Cerrato, Anne Mahoney, Thomas L. Milbank, David Mimno, Jeffrey A. Rydberg-Cox, David A. Smith, and Christopher York. Joint Conference on Digital Libraries (JCDL) 2003, Houston, TX.