How do you
win the Netflix Prize or a KDD Cup, or succeed in any other supervised learning task? The
state-of-the-art approach is to build an ensemble of most powerful
classification methods, train each of them on your data, and then construct an
aggregated classifier by carefully combining and weighting the ensemble
participants.
One would
imagine that an analogous scheme would work in clustering: build an ensemble of powerful clustering methods, apply
each of them to your data, and then somehow average the results. Well, such a
scheme was invented long ago, under the name of Consensus
Clustering, or Cluster
Ensembles. The only problem is that the consensus clustering does not
achieve very impressive results after all. At least, it is not as unbeatable as
classifier ensembles in supervised learning. Why is that so?
The problem
is that different clustering methods, while being completely unsupervised,
construct very different models (i.e.
clusterings) of the data. Let us say that our data is a collection of geometric
shapes (e.g., circles, triangles, rectangles etc.), each of a different color.
One clustering algorithm may cluster our data instances according to their
shape, another may cluster them according to their color. Averaging those
results would be hopeless.
In practice,
it is hard to find a clustering method that would cluster data according to any
specific criterion (i.e. shape or color), unless it is explicitly incorporated
into the optimization objective. A typical clustering method would construct a
model that provides some signal both
about the instances' shapes and about their colors. Averaging such models might
not be hopeless, but still not extremely advantageous, as the original models
are so dramatically different from each other.
We propose to
apply a special type of optimization process (called Deliberation) to the original models, which would bring them "closer"
to each other, without compromising their quality. Once the models are
substantially close to each other, averaging them makes much more sense. Consider
the space of all possible clusterings of a dataset (for simplicity, we deal
only with hard clusterings, i.e. data
partitionings). Two clusterings are neighbors in this (discrete) space if one
clustering can be transformed into the other by moving one data instance from a cluster to another cluster (i.e. the
clusterings are identical with the exception of one data instance). By a series
of such moves, any clustering can be transformed
into any clustering, such that the
task of bringing clusterings closer to each other can be trivially solved.
However, we do not want to sacrifice the quality of the original clusterings.
Since they were built by trustable clustering algorithms, we can assume that
they are pure enough - we are interested in preserving (and further improving)
this purity.
Our idea is
to bring the clusterings closer to each other by applying to them the same
optimization procedure as used in one of the trustable clustering algorithms,
while conditioning all clusterings on
each other. We can illustrate this process on a specific type of clustering
algorithms: the co-clustering
algorithms. The goal of co-clustering is to build a clustering of data
instances X while simultaneously building a clustering of
their features Y. A typical
co-clustering algorithm alternates between optimizing X and Y. First, it fixes
a clustering of Y and optimizes
clusters of X while conditioning them
on the (fixed) clustering of Y. Next,
it fixes the clustering of X, and
optimizes clusters of Y while conditioning
them on the fixed clustering of X -
and so forth in a round-robin fashion.
Our
deliberation process starts with m
clusterings of X and m clusterings of Y. First, it fixes all the Y
clusterings and applies a clustering optimization procedure to each X
clustering while conditioning it on all
the (fixed) Y clusterings. Next, it
fixes the (new) X clusterings and applies
a clustering optimization procedure to each Y
clustering while conditioning it on all the fixed X clusterings - and so on. This way, all the X clusterings get closer to each other, and so do all the Y clusterings - we say that the clustering
stability is improved. Also, since the
deliberation process employs a practically unchanged version of the clustering
optimization procedure, we can expect that the clustering quality would not
decrease (and even improve). Currently, the deliberation process works only on
co-clustering methods, however, it can be generalized to any other types of
clustering methods.
The rationale
behind the deliberation process comes from observing well established decision-making
processes in the human society. For example, in jury courts, each juror first attends
the hearings and mentally builds his or her model of the case (which is
analogous to a clustering algorithm building its local model of the data).
Next, the jurors retire for deliberation, where they bring their local models
closer to each other by discussing their views and opinions. Only after the
deliberation is over, the jurors vote (which corresponds to averaging the
clustering results in a consensus clustering scheme). This clear analogy reveals
the lack of "deliberation" in the (standard) consensus clustering. Our approach
fills this gap.
Consensus
clustering with the deliberation phase works really well in practice!!! We
applied this method to a variety of datasets, one of which is the popular benchmark
20 Newsgroups (20NG)
dataset. When a clustering algorithm of decent quality is applied to the
(entire) 20NG dataset, the resulting clustering accuracy is usually less than
0.5. A very good clustering algorithm (such as CLUTO) can achieve around
0.6 clustering accuracy on 20NG. In 2004 I obtained
0.7 clustering accuracy, and since then, for almost 5 years, I have not been
able to improve this result even by 0.01. Consensus clustering with
deliberation achieves 0.75 - 0.78 clustering accuracy (depending on specific
parameter settings). This is an impressive result for a completely unsupervised
method on a relatively large dataset of 19,997 instances.
|
|
R.
Bekkerman, M.
Scholz, and K. Viswanathan.
Improving
Clustering Stability with Combinatorial MRFs. In Proceedings of KDD 2009 bibtex |