What is Scatter/Gather?
Grouping Documents using Text Clustering

When a user enters a query into a search engine, the system often brings back many different pages. What should be done with all of these search results? How can we help the user understand what the system brought back and how it is related to the query? One solution we have explored is to show the user how the terms in the query are related to the words in the document. This solution is embodied in the TileBars interface.

Another strategy is to organize the documents into meaningful groups.

There are many different ways we can show how a set of documents are related to one another. One way is group together all documents written by the same author, or all documents written in the same year, or published by the same publisher. We can group according to subject matter as well. Libraries organize some of their information this way, using classification systems like the Dewey Decimal. For example, in the library, books about European history are placed in one location, and books on computer science in another.

This kind of system is useful, but only goes so far. One problem is that if there are, say, 2000 books on European History, they need to be subdivided into more fine-grained categories in order to be manageable for browsing. However, the topic of European History can be subdivided in many ways, for instance, by nation, by economic changes, by political movements, by cultural characteristics, and so on. And of course, all of these topics are inter-related.

It is necessary in the library to place books under one subject heading most of the time, because books are physical objects and can appear in only one place at a time. The computer gives us the extra power of allowing us to group books and other documents in many different ways.

Another problem with assigning documents to single categories within a hierarchy (as seen in, for example, Yahoo) is that most documents discuss several different topics simultaneously. Although real-life objects can often be assigned one place in a taxonomy (a truck is a kind of vehicle) textual documents are not so simply classified. Text consists of abstract discussions of ideas and their inter-relationships. It is a rare document that is only about trucks; instead, a document might discuss recreational vehicles, or the manufacturing of recreational vehicles, or for that matter the trends in manufacturing of American recreational vehicles in Mexico before vs. after the NAFTA agreement. The tendency in building taxonomic hierarchies is to create ever-more-specific categories to handle cases like this. A better solution is to describe documents by a set of categories, as well as attributes (such as source, date, genre, and author), and provide good interfaces for manipulating these labels.

As an alternative, the Scatter/Gather interface uses text clustering as a way to group document according to the overall similarities in their content. Scatter/Gather is so named because it allows the user to scatter documents into clusters, or groups, then gather a subset of these groups and re-scatter them to form new groups.

Each cluster in Scatter/Gather is represented by a list of topical terms, that is, a list of words that attempt to give the user the gist of what the documents in the cluster are about. The user can also look at the titles of the documents in each group. The documents can in the cluster can have other representations as well, such as summaries, or TileBars.

If a cluster still has too many documents, the user can re-cluster the documents in the cluster; that is, re-group that subset of documents into still smaller groups. This re-grouping process tends to change the kinds of themes of the clusters, because the documents in a subcollection discuss a different set of topics than all the documents in the larger collection.

The best way to understand how Scatter/Gather works is to look at a couple of examples.

Back to Scatter/Gather Overview


Xerox PARC
2/13/97