This page provides a simplified, incomplete explanation of how text clustering works. It is intended for non-scientists. Here is more technical information.
What is text clustering? The main idea is to find which documents have many words in common, and place the documents with the most words in common into the same groups. We can illustrate how this works by using a small collection of imaginary documents.
Pretend we have a collection of documents, labeled A through N, that contain (among others) the words shown in the table:
| Doc | Planet | Galaxy | Nova | Film | Role | Hollywood | Diet | Habitat | Fur |
|---|---|---|---|---|---|---|---|---|---|
| A | 2 | 1 | 1 | ||||||
| B | 1 | 1 | 3 | ||||||
| C | 2 | 1 | |||||||
| D | 1 | 2 | 1 | ||||||
| E | 1 | 1 | 1 | ||||||
| F | 3 | 1 | |||||||
| G | 1 | 5 | 2 | ||||||
| H | 2 | 1 | 1 | ||||||
| I | 2 | 3 | 2 | ||||||
| J | 1 | 1 | 3 | ||||||
| K | 4 | 1 | |||||||
| L | 1 | 1 | 1 | ||||||
| M | 2 | 1 | 1 | 1 | 2 | ||||
| N | 2 | 1 | 1 | 1 | 1 |
The numbers indicate how often each document contains each word. You can see by eyeballing the table that the documents are probably about one of three main themes: astronomy, movie stars, and animals. (In fact, documents with all of these themes come up when you query on the word star in an encyclopedia. This is related to a real example).
However, the last two documents are a bit strange: Document M contains words having to do with astronomy and movie stars (because the article might discuss film used to take photos of a planet, or the role of volcanic activity in forming craters on the surface), and Document N contains words having to do both with movie stars and animals (perhaps an interview with a movie star who diets and is active in an anti-fur campaign).
A clustering program tries to find the groups in the data. The one used by Scatter/Gather decides in advance that some number of clusters, say three, is desired. The program first chooses three documents that seem representative of the middle of each of the three clusters, say in this case, documents A, F, and J. These will act as the candidate centers of the clusters. The program then goes through and compares all the documents to A, F, and J. Each documents is assigned to the cluster represented by whichever of A, F, and J it is most similar to. Similarity is based on how many words the documents have in common, and how strongly they are weighted. The topical terms of the clusters are chosen from words that represent the center of the cluster (although excluding words that are in the centers of any of the other clusters).
What this explanation didn't mention is how A, F and J were chosen. This can be done in many different ways. One way is to try all possible combinations of three documents, and see which ends up with the best overall clustering. (The best clustering is one in which the average difference of the documents to their cluster centers smallest.) Trying all combinations of three documents as the candidate centers can be very time-consuming, however, especially if there are many documents to cluster. Another possibility is to choose a small sample of documents, and find the centers just for those documents, and assume they represent the collection well.
Another strategy for choosing the representative documents for the cluster centers is to use a slower clustering algorithm only on a small sample of the documents. One of these is called average agglomerative clustering. It works by first comparing every pair of documents, and finding the pair of documents which are most similar to each other. This pair is placed into a new cluster, and now that cluster is represented by the average of the two documents (by the average weights of all the words in all the documents in the cluster). Then the process repeats. All the remaining documents are compared against each other and the clusters found so far. Again, the document or cluster that is found to be most similar to another document or cluster is merged with it. This process repeats until some target number of clusters is found.
Getting back to our example, as you might expect, the program has no problem placing documents A-D in one cluster, documents E-H in a second, and documents I-L in a third. The tricky question is: where do M and N go? They can easily well end up in either the second or third cluster. There are techniques we can use to deal with these cases, for example, allowing some documents to be placed in more than one cluster at the same time. But this examples shows why, in the words of Kaufmann and Rousseeuw, ``Cluster analysis is the art of finding groups in data.''
For more information, there are many, many books on statistics and clustering. Here is one suggestion.
Leonard Kaufmann and Peter J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Inc., NY, 1990.
The first Scatter/Gather paper explains the algorithms used in detail.
A good survey of the use of clustering in information access is:
Peter Willett, Recent trends in hierarchical document clustering: A critical review. Information Processing & Management, 24(5):577-597, 1988.