Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression
David Mimno and Andrew McCallum.
To appear in UAI, 2008 (selected for plenary presentation)
PDF
Text documents are usually accompanied by metadata, such as the authors,
the publication venue, the date, and any references. Work in topic modeling
that has taken such information into account, such as Author-Topic,
Citation-Topic, and Topic-over-Time models, has generally focused on
constructing specific models that are suited only for one particular type
of metadata. This paper presents a simple, unified model for learning
topics from documents given arbitrary non-textual features, which can be
discrete, categorical, or continuous.
Modeling Career Path Trajectories
David Mimno and Andrew McCallum.
University of Massachusetts, Amherst Technical Report #2007-69, 2007.
PDF
Descriptions of previous work experience in resumes are a valuable source of
information about the structure of the job market and the economy. There is,
however, a high degree of variability in these documents.
Job titles are a particular problem, as they are often either overly sparse
or overly general:
85% of job titles in our corpus occur only once, while the most common titles, such as "Consultant", are so broad as to be virtually meaningless.
We use a hierarchical hidden state model to discover
clusters of words that correspond to distinct skills, clusters of skills
that correspond to jobs, and transition patterns between jobs.
Community-based Link Prediction with Text
David Mimno, Hanna Wallach, and Andrew McCallum.
Statistical Network Modeling Workshop, NIPS, 2007, Whistler, BC.
Expertise Modeling for Matching Papers with Reviewers
David Mimno and Andrew McCallum.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2007, San Jose, CA.
PDF
Data
Science depends on peer review, but matching papers with reviewers is a
challenging and time consuming task. We compare several automatic methods for
measuring the similarity between a submitted abstract and papers previously
written by reviewers. These include a novel topic model that automatically
divides an author's papers into topically coherent "personas".
Probabilistic Representations for Integrating Unreliable Data Sources
David Mimno, Andrew McCallum and Gerome Miklau.
IIWeb workshop at AAAI 2007, Vancouver, BC, Canada.
PDF
Mixtures of Hierarchical Topics with Pachinko Allocation.
David Mimno, Wei Li and Andrew McCallum.
International Conference on Machine Learning (ICML) 2007, Corvallis, OR.
PDF
The four-level pachinko allocation model
(PAM) (Li & McCallum, 2006) represents
correlations among topics using a DAG structure. It does not, however, represent a
nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more
specic topics. This paper presents hierarchical PAM — an enhancement that explicitly represents a topic hierarchy. This model
can be seen as combining the advantages of
hLDA's topical hierarchy representation with
PAM's ability to mix multiple leaves of the
topic hierarchy. Experimental results show
improvements in likelihood of held-out documents, as well as mutual information between
automatically-discovered topics and human-generated categories such as journals.
Mining a digital library for influential authors.
David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada.
PDF
Most digital libraries let you search for documents, but we often want to
search for people as well. We extract and disambiguate author names from
online research papers, weight papers using PageRank on the citation graph,
and expand queries using a topic model. We evaluate the system by comparing
people returned for the query "information retrieval" to recipients of
major awards in IR.
Organizing the OCA: Learning faceted subjects from a library of digital books.
David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2007, Vancouver, BC, Canada.
PDF
The Open Content Alliance is one of several large-scale digitization projects
currently producing huge numbers of digital books. Statistical topic models
are a natural choice for organizing and describing such large text corpora,
but scalability becomes a problem when we are dealing with multi-billion
word corpora. This paper presents a new method for topic modeling, DCM-LDA.
In this model, we train an independent topic model for every book, using
pages as "documents". We then gather the topics discovered, cluster them,
and then fit a Dirichlet prior for each topic cluster. Finally, we retrain
the individual book topic models using these new shared topics.
Beyond Digital Incunabula: Modeling the Next Generation
of Digital Libraries.
Gregory Crane, David Bamman, Lisa Cerrato, Alison Jones, David Mimno, Adrian Packel, David Sculley, and Gabriel Weaver.
European Conference on Digital Libraries (ECDL) 2006, Alicante, Spain.
PDF
Several groups are currently embarking on large scale digitization projects,
but are they producing anything more than lots of raw text? This paper argues
that such an investment in digitization will be more valuable if accompanied
by a parallel investment in highly structured resources such as dictionaries.
Several examples, including some I worked on while at Perseus, illustrate
this effect.
Bibliometric Impact Measures Leveraging Topic Analysis.
Gideon Mann, David Mimno and Andrew McCallum.
Joint Conference on Digital Libraries (JCDL) 2006, Chapel Hill, NC.
PDF
Powerpoint
When evaluating the impact of research papers, it's important to compare
similar papers: a massively influential paper in Mathematics may be as
well cited as a middling paper in Molecular Biology. We present a system
that combines automatic citation analysis on spidered research papers
with a new automatic topic model that is aware of multi-word terms. This
system is capable of finding fine-grained sub-fields while scaling to the
exponential increase in open-access publishing. We evaluate papers from the
Rexa digital library using both
traditional bibliometric statistics (substituting topics for journals) as
well as several new metrics.
Hierarchical Catalog Records: Implementing a FRBR Catalog.
David Mimno, Alison Jones and Gregory Crane.
DLib, October 2005. HTML
Finding a Catalog: Generating Analytical Catalog Records from Well-structured Digital Texts.
David Mimno, Alison Jones and Gregory Crane.
Joint Conference on Digital Libraries (JCDL) 2005, Denver, CO.
PDF.
Services for a Customizable Authority Linking Environment.
Mark Patton and David Mimno.
demonstration at Joint Conference on Digital Libraries (JCDL) 2004, Tucson, AZ.
Towards a Cultural Heritage Digital Library.
Gregory Crane, Clifford E. Wulfman, Lisa M. Cerrato, Anne Mahoney,
Thomas L. Milbank, David Mimno, Jeffrey A. Rydberg-Cox, David A.
Smith, and Christopher York. Joint Conference on Digital Libraries (JCDL) 2003, Houston, TX.