Pattern Recognition
Course Notes
Ó 1998, 2001 Paul F. Stetson
Dept. of Information Technology
Technical University of Denmark
Chapter 4
Classical
Unsupervised PR: Clustering and Vector Quantization
In this section we begin dealing with various
classification methods that use distance instead of correlation as a criterion for
grouping. Here we also take our first
look at unsupervised methods, which do not require labeled training
data. Instead, these techniques look
for natural groupings in the data.
In PR applications, unsupervised techniques have
typically been referred to as clustering.
Related techniques for data compression are referred to as vector
quantization.
I. Clustering
A cluster
is a grouping of multidimensional data, and clustering
refers to a large variety of techniques, mostly iterative, that have been developed
to assign data to clusters based on the distribution in an unlabeled training
set.
Clustering may be viewed as locating the peaks of a
probability distribution. It may also
be approached from the standpoint of finding valleys that separate the peaks. We are more interested in mountain-climbing
here, though, for then we can base our clusters on a model or distance
metric. This lends itself well to
implementation by the feedforward network architectures we deal with here.
All techniques need some measure of the goodness of
the clustering. This may be a measure
of the dispersion in the groups. If we
whiten the data, our total variance ST
= SW + SB (defined in Bishop, Sec.
3.6.3) will be normalized to I. Then
we can focus only on the within-class covariance SW: minimizing that will simultaneously maximize the
separation between classes, SB.
Two scalar measures for the dispersion are the trace
and the determinant of SW. The trace is defined as the sum of the elements
along the main diagonal, and it is equal to the sum of the eigenvalues.
![]()
Therefore, the trace of SW is the sum of the variances along the principal
components of SW. Similarly, since the determinant is the
product of the eigenvalues, it is proportional to the square of the volume in
that cluster. Because both can be
expressed in terms of the eigenvalues, they do not vary when the coordinates
are rotated or shifted.
Many techniques are simply given a fixed number of
clusters to use, but some adapt the number of clusters to the data
distribution. For example, the number
of clusters created may be determined by setting a threshold for dispersion.

Fig. 1 - Dispersion
threshold. Data are indicated by dots, cluster
centers by crosses,
and dispersion range by circles.
Merging methods take the approach of creating many
small clusters, the nearest of which are then fused to create larger clusters. Splitting
methods start with the data all in one cluster that is successively split finer
and finer.

(a)

(b)
Fig. 2 - Merging (a) and
splitting (b) clusters.
A data point is always assigned to the “nearest”
cluster, but we need some distance criterion to make this decision. The obvious one is the Euclidean metric, the
squared norm of the difference between the data point and the cluster
prototype. A city-block metric might be
more appropriate with discrete data or if computation time is more critical
than accuracy. If the data have
disparate eigenvalues, a given distance along a low-variance axis represents a
greater aberration than the same distance along a high-variance axis. In that case, whiten the data, or equivalently,
compensate for the different variances by using the Mahalanobis distance
[Bishop, p. 35].
Nearest-means
clustering
We often identify a cluster by a single vector at or
near its center. This vector, often
referred to as a prototype or template, serves as an example of that
category. Nearest-means clustering uses
the mean of each cluster as the prototype vector. It tries to minimize dispersion by assigning new training data to
the cluster with the nearest mean.
A common example is k-means clustering [Bishop, Sec. 5.9.3], which comes in a variety
of flavors. One version randomly picks k members of the training set as the
initial cluster centers. For the
remainder of the training set, each member is associated with the nearest
prototype, then the prototypes are updated to reflect the new cluster
assignments. The training set is
processed repeatedly until there is no change in the cluster membership or
until the dispersion reaches some minimum.
This technique is not guaranteed to converge to a global minimum. It is sensitive to the initial values and
can miss some clusters while assigning several prototypes to others.
Note that the technique described by Bishop
initializes the clusters using random subsets of the entire training set. These vectors will tend to be closer to the
mean of the data than randomly selected individual data points. Since there is less variance in the initial
cluster assignments, there is more likelihood of missing "stray"
clusters. For good convergence, it is
generally important that the initial values give good coverage of the range of
inputs. However, there is no single
superior method for initialization, and if possible, one might try different
approaches to see which works best for that application.
In using nearest-means techniques with a Euclidean
or Mahalanobis distance metric, we are implying a Gaussian model for the
data. We classify the data by its
distance to the prototype, which is proportional to the log of a Gaussian
centered at the cluster mean. Therefore,
this corresponds to fitting a summation of equal-covariance Gaussians to the
probability density. This is like the
Gaussian parametric modeling of class-specific pdf's for Bayesian estimation,
except that we do not weight the clusters with prior probabilities.
This gives us some insight into the formation of
cluster boundaries. If we are using a
Euclidean distance metric, the boundary between two groups will be the
perpendicular bisector of the difference vector. The Mahalanobis metric uses covariance information to rotate the
boundary. Considering a more complex
method in which each cluster has a different covariance matrix associated with
it, we can get piecewise-quadratic cluster boundaries.
II. Vector
Quantization
In general, quantization is the process of
converting a continuous-valued signal into a discrete one. Typically it refers to the scalar case, but
a set of related signals can be quantized as a vector to take advantage of the
correlations and other redundancies in the data. This is useful for both compression and representation of
data.
Vector quantization (VQ) is basically just
nearest-means clustering applied to signal compression. In this context, it is also referred to as
pattern-matching quantization, because new data are matched to the prototype
vectors in a search for the nearest pattern.
Since nearest-means clustering categorizes all points in the input
space, we can draw a tessellation (division into tiles) of the input space
using the perpendicular bisectors of the difference vectors.

Fig. 3 - Tessellation of
input space. Cluster Splitting is
illustrated here.
The data is "abbreviated" by substituting
the prototypes for the data. Actually,
the prototypes are labeled, and the label itself is used to represent the data
for maximum compression. (This is sort
of like abbreviating "February" with "Feb.", which can then
be represented by the numeral "2".)
The labels are stored or transmitted along with a "codebook"
to translate those labels. Of course
the decoding can not recover the original data, only the prototype vectors used
to approximate it.
In scalar quantization, we typically sample the data
at regular intervals. If we know
something in advance about the amplitude distribution of the signal, though, we
can adjust those levels to be closer in regions that occur more
frequently. This reduces the expected
quantization error.

Fig. 4 - Non-uniform
sampling levels. If the pdf of the
input level is known, we can get lower quantization error by sampling closer at
levels of high probability.
In the vector case, we can imagine histogramming the
data and using the indexes of the histogram cells to represent the data. This would be an ideal vector quantization
for uniformly distributed data. If
there are peaks in the distribution, we might want to have smaller cells at
those peaks to reduce quantization error.
If there is correlation in the data, we can take
advantage of this to minimize the distortion obtained at a given sampling
resolution. Note that we are sampling
the probability distribution on a rectangular grid. We would like to fit the data into that rectangular grid as
neatly as possible, to avoid sampling the tails of the distribution. This can
be done using a principal coordinate transform. In the new coordinate system, we can set our quantization levels
according to the variances along those axes, giving the optimum density of
samples in the peak.

Fig. 5 - Eigenvector transformation makes sampling correlated data more
economical.
If the data are non-Gaussian, we may find that
decorrelation of the data does not help us decrease error much. To sample the distribution more finely at
its peak, we can use clustering techniques like k-means to locate points for sampling. Minimizing the trace of SW
is the same as minimizing the expected squared-error of quantization. Consider this a kind of histogramming with
non-rectangular cells that fit the distribution.
If we want to get the best approximation for our data, we need to quantize it as finely as possible, which means generating as many prototypes as possible. Rather than compare each data vector to all the vectors in a very large codebook, we can organize the clusters so that we can efficiently find the nearest cluster to an input vector. This uses a splitting approach: we can perform 2-means clustering repeatedly to subdivide the data until the clusters reach a specified dispersion. If we save the results from each split, we have a binary tree of clusters. At each level in the tree, we compare the input to the two branch prototypes and choose the branch that matches the input best. Then the comparison is repeated until we reach a "leaf" node in the tree. There are a variety of techniques for tree-structured vector quantization; in PR, this is called hierarchical clustering.
On the other hand, we might also try a parallel
implementation of the codebook. Then
all the codebook vectors would be simultaneously available, and there would be
no need to search a tree. This of
course is the neural approach, which we will look at later.
In closing, I should mention that there is a fine
distinction between the objectives of clustering and VQ. Clustering might use a single prototype to
represent a high-probability region, whereas VQ would want several. Furthermore, in clustering the emphasis has
often been on how to group things in a set of data, without any thought about
how that will perform on new data. In VQ, the generalization ability of the
codebook has always been most important.
References
K. Fukunaga, Introduction
to Statistical Pattern Recognition, Academic Press, San Diego, 1990.
J. Makhoul, S. Roucos, and H. Gish, "Vector
Quantization in Speech Coding", Proceedings
of the IEEE, v. 73, no. 11, pp.1551-1588, November 1985.
R. Gray, "Vector Quantization", IEEE ASSP Magazine, v. 1, pp. 4-29, Apr.
1984.
Exercises
1. Cluster Dispersion Criteria
a) If total covariance matrix
is
normalized to I through whitening, use
and
to show that maximizing the trace of
the between-class covariance matrix
is the same as minimizing the trace of
the within-class covariance matrix
.
b)
For
, show
that maximizing
is the
same as minimizing
.