Virtually all current IR technology is
concerned with finding documents according to topic. But when people
search for documents, they are usually interested not just in topic
but in genre and form -- they want to see editorials about the
supercollider, nontechnical articles on networking, or whatever. And
the need to address these concerns increases with large heterogeneous
collections like the corporate intranet, the corporate extranet and
the World Wide Web. Northrop (named after the critic Northrop
Frye, whose photo here is from the Harry
Palmer Gallery) is a genre categorizer that lets users narrow
down searches to particular genres like editorials, financial reports
or scientific writing or group search results according to
genre. Northrop transforms the hodgepodge of the usual hit list into a
well structured result format.
Northrop works with a textbase of 4,373 records that were downloaded from the Dialog database service with the permission of Knight-Ridder Information Services, Inc. These records were selected in such a way as to provide a spread across a reasonable number of data sources and record genres. To guard against experimenter bias in assigning genre labels to records, we downloaded only records that have some concrete and objective indication of genre. We also restricted our purview to a single subject, AIDS and the HIV virus, so that it would be clear that most of the variation between records is due to genre, not subject matter. With these criteria, we identified ten well-defined and well-populated genres, for which we downloaded the most recent records available:
Northrop identifies the genre of records by examining approximately fifty different textual cues. All of these cues are relatively easy to compute: examples are the number of various punctuation marks, the number of different word types, and variation in the length of words. For each potential genre, Northrop applies a logistic regression formula over these cues, and assigns the record to the genre that gets the highest score. Instead of recomputing the genre of each record during each search, the genre guess was computed once and is stored as part of the indexing of the database.
To verify that Northrop is using generalizable facts about genres, it was trained (using a Generalized Linear Model implementation from SPlus) over 80% of the corpus, then evaluated over the other 20% of the records. In this test, Northrop correctly categorized 92% of those previously unseen records. This performance is approximately the same as that obtained in this demo, which uses the entire database.
Further information on the research methodology, as applied to an earlier database, is available in a paper presented at ACL/EACL 1997.