Northrop:

Automatic categorization of documents by genre

Demo.

What is Northrop?

Portrait of
Northrop Frye Virtually all current IR technology is concerned with finding documents according to topic. But when people search for documents, they are usually interested not just in topic but in genre and form -- they want to see editorials about the supercollider, nontechnical articles on networking, or whatever. And the need to address these concerns increases with large heterogeneous collections like the corporate intranet, the corporate extranet and the World Wide Web. Northrop (named after the critic Northrop Frye, whose photo here is from the Harry Palmer Gallery) is a genre categorizer that lets users narrow down searches to particular genres like editorials, financial reports or scientific writing or group search results according to genre. Northrop transforms the hodgepodge of the usual hit list into a well structured result format.

What genres does Northrop identify?

Northrop works with a textbase of 4,373 records that were downloaded from the Dialog database service with the permission of Knight-Ridder Information Services, Inc. These records were selected in such a way as to provide a spread across a reasonable number of data sources and record genres. To guard against experimenter bias in assigning genre labels to records, we downloaded only records that have some concrete and objective indication of genre. We also restricted our purview to a single subject, AIDS and the HIV virus, so that it would be clear that most of the variation between records is due to genre, not subject matter. With these criteria, we identified ten well-defined and well-populated genres, for which we downloaded the most recent records available:

Book review.
371 book reviews from Periodical Abstracts, ultimately from some three dozen different periodicals.
Calendar.
The most recent 500 records from the Federal News Service Daybook. These are notices about upcoming events.
Company profiles.
300 records from IMS.
Drug data.
500 reports from Adis R&D Insight and Drug Data Report.
Financials.
157 records from EdgarPlus databases, spread across four subtypes: prospectuses, registration statements, 10-K filings, and 10-Q filings.
Market research.
307 records from three different databases: Datamonitor, Freedonia, and ICC International Business Research.
Medical research.
147 articles from the databases of Lancet and the New England Journal of Medicine.
News.
874 records pulled from six newspaper databases (London Times, Toronto Star, Montréal Gazette, New York Times, San Francisco Chronicle, and San Jose Mercury News), five newswires (Africa News, Agence France Press English Wire, PR NewsWire, UPI News, Xinhua News), and fifteen weeklies represented in the Periodical Abstracts database (American Journal of Nursing, British Medical Journal, Chronicle of Higher Education, Economist, FDA Consumer, Forbes, Jet, Lancet, Maclean's, Science, Science News, Time, USA Today Magazine, and World Press Review).
Opinion.
794 records representing editorials, columns, and letters to the editor, from eight journals represented in the Periodical Abstracts database (AIDS Alert, British Medical Journal, Journal of the American Medical Association, Lancet, Nation, National Review, New Republic, Science) and from four newspaper databases (Montréal Gazette, San Francisco Chronicle, San Jose Mercury News, Toronto Star).
Patents.
423 patent applications, from European Patents Fulltext and U.S. Patents Fulltext.

How does Northrop work?

Northrop identifies the genre of records by examining approximately fifty different textual cues. All of these cues are relatively easy to compute: examples are the number of various punctuation marks, the number of different word types, and variation in the length of words. For each potential genre, Northrop applies a logistic regression formula over these cues, and assigns the record to the genre that gets the highest score. Instead of recomputing the genre of each record during each search, the genre guess was computed once and is stored as part of the indexing of the database.

To verify that Northrop is using generalizable facts about genres, it was trained (using a Generalized Linear Model implementation from SPlus) over 80% of the corpus, then evaluated over the other 20% of the records. In this test, Northrop correctly categorized 92% of those previously unseen records. This performance is approximately the same as that obtained in this demo, which uses the entire database.

Further information on the research methodology, as applied to an earlier database, is available in a paper presented at ACL/EACL 1997.


Brett Kessler (web page author), Geoff Nunberg and Hinrich Schütze / ISTL QCA / Sep. 25, 1997