Topic 87: Criminal Actions Against Officers of Failed Financial Institutions
We formulated a query containing the terms
and instructed the system to retrieve the 500 top-ranked documents according to a SMART-like weighting. Out of these 500 retrieved documents, only 21 had been judged relevant to the query by the TREC judges (some may not have been judged at all, but for the purposes of this example, those with no judgement are simply considered to be not relevant). These documents were not ranked especially highly by the similarity search measure: none of the documents judged relevant appeared in the top 10, only one appeared in the top 20, and only four appeared in the top 40.
A tool that can guide the user towards the relevant subgroups would indeed be useful. Here we show that the Scatter/Gather tool can be effective in this way. The system is instructed to gather the 500 documents into five clusters; below are shown the resulting sizes and topical terms:
Cluster 3 stands out for the purposes of the query in that it contains terms pertaining to fraud, investigation, lawyers, and courts. Note that in a general corpus these terms might not be descriptive for this query since the user would assume the documents were about legal issues in general. However, since we know the system has retrieved documents that also pertain to financial institutions, we can assume that the legal terms occur in the context of financial documents.
The topical terms for Cluster 4 are less compelling, for they have only one term corresponding to a crime and seem to clearly indicate documents discussing the scandal involving the leader of the Phillipines in the late 80's. The topical terms of Cluster 0 are very general (and there are only four documents in the cluster which can be quickly scanned). It appears that Cluster 0 contains very general documents that do not fit into any of the other clusters particularly well, whereas Cluster 4 contains documents that relate to a very specific allegation of fraud.
Cluster 1 is also compelling in that its summary contains many financial terms; however, it is less promising than Cluster 3 in that it seems more related to assets and risk assessment than criminal charges and failed banks. Finally, Cluster 2 seems related most strongly to rules and regulations, rather than indictments and fraud. Note again that this cluster, if taken out of context, might seem to refer to government regulations in general; however, since it was generated as the results of a query on financial terms, like Cluster 3, it most likely contains documents discussing rules and regulations on financial matters. A rescattering of the cluster confirmed this suspicion.
Based on this assessment, Cluster 3 looks most promising. If the user rescatters it, five new clusters are produced:
The user might choose the first, third and fourth clusters since their topical terms all seem to pertain to the topic of interest. Clusters 3 and 4 are especially compelling since they contains terms pertaining both to finance and to criminal proceedings. Cluster 3 has more terms about conviction but 4 has more terms pertaining to failure and the kinds of financial institutions that the user may have known to have failed; namely S\&L's and thrifts. As it turns out, the clusters' contents reflect these observations: Cluster 3 contains mainly articles about indictments pertaining to financial fraud involving securities and stocks, but not failed banks. The user can view the contents of a cluster in ranked order, according to the score generated by the similarity search, or can view the documents according to some other search tool. Based on the topical terms, the most promising looking clusters are 1, 3, and 4. It turns out that Cluster 1 has one relevant document out of four.
The third cluster, Cluster 3, has one relevant document out of 28. High ranked documents in this group include indictments for other crimes, some of which are financial in nature.
Cluster 4 has eleven relevant documents out of 29.
The second cluster also contains two relevant documents. Most other documents in this cluster discuss indictments for money laundering along with one article involving Noriega and another on a teen scandal in San Francisco. The last cluster has no relevant documents although it has several that discuss the BCCI.
So it turns out that the top-level Cluster 4 contains 15 of the 21 relevant documents. The remaining 6 relevant documents are found exclusively in Cluster 2. When this cluster, which contains 187 documents, is scattered:
all six relevant documents appear in one cluster, Cluster 3, of size 88:
For more details, see the following papers:
Marti A. Hearst, David R. Karger, and Jan O. Pedersen, Scatter/Gather as a Tool for the Navigation of Retrieval Results Access , The proceedings of the 1995 AAAI Fall Symposium on Knowledge Navigation.)
D. Cutting, D. Karger, J. Pedersen, and J.W. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of SIGIR'92.
D. Cutting, D. Karger, and J. Pedersen. Constant Interaction-Time Scatter/Gather Browsing of Large Document Collections. Proceedings of SIGIR'93.
Marti Hearst / hearst@parc.xerox.com Last modified: Wed Sept 6 1995