Pattern Recognition Course Notes

Ó 1998, 2001 Paul F. Stetson

Dept. of Information Technology

Technical University of Denmark

Chapter 4

Classical Unsupervised PR: Clustering and Vector Quantization

In this section we begin dealing with various classification methods that use distance instead of correlation as a criterion for grouping.  Here we also take our first look at unsupervised methods, which do not require labeled training data.  Instead, these techniques look for natural groupings in the data. 

In PR applications, unsupervised techniques have typically been referred to as clustering.  Related techniques for data compression are referred to as vector quantization.

I.  Clustering

A cluster is a grouping of multidimensional data, and clustering refers to a large variety of techniques, mostly iterative, that have been developed to assign data to clusters based on the distribution in an unlabeled training set.

Clustering may be viewed as locating the peaks of a probability distribution.  It may also be approached from the standpoint of finding valleys that separate the peaks.  We are more interested in mountain-climbing here, though, for then we can base our clusters on a model or distance metric.  This lends itself well to implementation by the feedforward network architectures we deal with here. 

All techniques need some measure of the goodness of the clustering.  This may be a measure of the dispersion in the groups.  If we whiten the data, our total variance ST = SW + SB (defined in Bishop, Sec. 3.6.3) will be normalized to I. Then we can focus only on the within-class covariance SW: minimizing that will simultaneously maximize the separation between classes, SB.

Two scalar measures for the dispersion are the trace and the determinant of SW.  The trace is defined as the sum of the elements along the main diagonal, and it is equal to the sum of the eigenvalues.

Therefore, the trace of SW is the sum of the variances along the principal components of SW.  Similarly, since the determinant is the product of the eigenvalues, it is proportional to the square of the volume in that cluster.  Because both can be expressed in terms of the eigenvalues, they do not vary when the coordinates are rotated or shifted.


Many techniques are simply given a fixed number of clusters to use, but some adapt the number of clusters to the data distribution.  For example, the number of clusters created may be determined by setting a threshold for dispersion.

Fig. 1 - Dispersion threshold.  Data are indicated by dots, cluster centers by crosses,
and dispersion range by circles.

Merging methods take the approach of creating many small clusters, the nearest of which are then fused to create larger clusters. Splitting methods start with the data all in one cluster that is successively split finer and finer.

(a)

 

(b)

Fig. 2 - Merging (a) and splitting (b) clusters.

 

A data point is always assigned to the “nearest” cluster, but we need some distance criterion to make this decision.  The obvious one is the Euclidean metric, the squared norm of the difference between the data point and the cluster prototype.  A city-block metric might be more appropriate with discrete data or if computation time is more critical than accuracy.  If the data have disparate eigenvalues, a given distance along a low-variance axis represents a greater aberration than the same distance along a high-variance axis.  In that case, whiten the data, or equivalently, compensate for the different variances by using the Mahalanobis distance [Bishop, p. 35].

Nearest-means clustering

We often identify a cluster by a single vector at or near its center.  This vector, often referred to as a prototype or template, serves as an example of that category.  Nearest-means clustering uses the mean of each cluster as the prototype vector.  It tries to minimize dispersion by assigning new training data to the cluster with the nearest mean. 

A common example is k-means clustering [Bishop, Sec. 5.9.3], which comes in a variety of flavors.  One version randomly picks k members of the training set as the initial cluster centers.  For the remainder of the training set, each member is associated with the nearest prototype, then the prototypes are updated to reflect the new cluster assignments.  The training set is processed repeatedly until there is no change in the cluster membership or until the dispersion reaches some minimum.  This technique is not guaranteed to converge to a global minimum.  It is sensitive to the initial values and can miss some clusters while assigning several prototypes to others.

Note that the technique described by Bishop initializes the clusters using random subsets of the entire training set.  These vectors will tend to be closer to the mean of the data than randomly selected individual data points.  Since there is less variance in the initial cluster assignments, there is more likelihood of missing "stray" clusters.  For good convergence, it is generally important that the initial values give good coverage of the range of inputs.  However, there is no single superior method for initialization, and if possible, one might try different approaches to see which works best for that application.

In using nearest-means techniques with a Euclidean or Mahalanobis distance metric, we are implying a Gaussian model for the data.  We classify the data by its distance to the prototype, which is proportional to the log of a Gaussian centered at the cluster mean.  Therefore, this corresponds to fitting a summation of equal-covariance Gaussians to the probability density.  This is like the Gaussian parametric modeling of class-specific pdf's for Bayesian estimation, except that we do not weight the clusters with prior probabilities.

This gives us some insight into the formation of cluster boundaries.  If we are using a Euclidean distance metric, the boundary between two groups will be the perpendicular bisector of the difference vector.  The Mahalanobis metric uses covariance information to rotate the boundary.  Considering a more complex method in which each cluster has a different covariance matrix associated with it, we can get piecewise-quadratic cluster boundaries. 

II. Vector Quantization

In general, quantization is the process of converting a continuous-valued signal into a discrete one.  Typically it refers to the scalar case, but a set of related signals can be quantized as a vector to take advantage of the correlations and other redundancies in the data.  This is useful for both compression and representation of data. 

Vector quantization (VQ) is basically just nearest-means clustering applied to signal compression.  In this context, it is also referred to as pattern-matching quantization, because new data are matched to the prototype vectors in a search for the nearest pattern.  Since nearest-means clustering categorizes all points in the input space, we can draw a tessellation (division into tiles) of the input space using the perpendicular bisectors of the difference vectors.

Fig. 3 - Tessellation of input space.  Cluster Splitting is illustrated here.

The data is "abbreviated" by substituting the prototypes for the data.  Actually, the prototypes are labeled, and the label itself is used to represent the data for maximum compression.  (This is sort of like abbreviating "February" with "Feb.", which can then be represented by the numeral "2".)  The labels are stored or transmitted along with a "codebook" to translate those labels.  Of course the decoding can not recover the original data, only the prototype vectors used to approximate it.

In scalar quantization, we typically sample the data at regular intervals.  If we know something in advance about the amplitude distribution of the signal, though, we can adjust those levels to be closer in regions that occur more frequently.  This reduces the expected quantization error.

Fig. 4 - Non-uniform sampling levels.  If the pdf of the input level is known, we can get lower quantization error by sampling closer at levels of high probability.

In the vector case, we can imagine histogramming the data and using the indexes of the histogram cells to represent the data.  This would be an ideal vector quantization for uniformly distributed data.  If there are peaks in the distribution, we might want to have smaller cells at those peaks to reduce quantization error.

If there is correlation in the data, we can take advantage of this to minimize the distortion obtained at a given sampling resolution.  Note that we are sampling the probability distribution on a rectangular grid.  We would like to fit the data into that rectangular grid as neatly as possible, to avoid sampling the tails of the distribution. This can be done using a principal coordinate transform.  In the new coordinate system, we can set our quantization levels according to the variances along those axes, giving the optimum density of samples in the peak.


Fig. 5 - Eigenvector transformation makes sampling correlated data more economical.

 

If the data are non-Gaussian, we may find that decorrelation of the data does not help us decrease error much.  To sample the distribution more finely at its peak, we can use clustering techniques like k-means to locate points for sampling.  Minimizing the trace of SW is the same as minimizing the expected squared-error of quantization.  Consider this a kind of histogramming with non-rectangular cells that fit the distribution. 

If we want to get the best approximation for our data, we need to quantize it as finely as possible, which means generating as many prototypes as possible.  Rather than compare each data vector to all the vectors in a very large codebook, we can organize the clusters so that we can efficiently find the nearest cluster to an input vector.  This uses a splitting approach: we can perform 2-means clustering repeatedly to subdivide the data until the clusters reach a specified dispersion.  If we save the results from each split, we have a binary tree of clusters.  At each level in the tree, we compare the input to the two branch prototypes and choose the branch that matches the input best.  Then the comparison is repeated until we reach a "leaf" node in the tree.  There are a variety of techniques for tree-structured vector quantization; in PR, this is called hierarchical clustering.

On the other hand, we might also try a parallel implementation of the codebook.  Then all the codebook vectors would be simultaneously available, and there would be no need to search a tree.  This of course is the neural approach, which we will look at later.

In closing, I should mention that there is a fine distinction between the objectives of clustering and VQ.  Clustering might use a single prototype to represent a high-probability region, whereas VQ would want several.  Furthermore, in clustering the emphasis has often been on how to group things in a set of data, without any thought about how that will perform on new data. In VQ, the generalization ability of the codebook has always been most important. 

References

K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, 1990.

J. Makhoul, S. Roucos, and H. Gish, "Vector Quantization in Speech Coding", Proceedings of the IEEE, v. 73, no. 11, pp.1551-1588, November 1985.

R. Gray, "Vector Quantization", IEEE ASSP Magazine, v. 1, pp. 4-29, Apr. 1984.

Exercises

1. Cluster Dispersion Criteria

a) If total covariance matrix is normalized to I through whitening, use  and  to show that maximizing the trace of the between-class covariance matrix  is the same as minimizing the trace of the within-class covariance matrix .

 

b) For , show that maximizing is the same as minimizing .

to Chap. 5 - Distance-Based Pattern Recognition