Articles that are describe and inform -- such as science magazine articles and environmental impact reports -- can be viewed as being composed of a few main topics and a series of short, sometimes densely discussed, subtopics. For example, consider a 23-paragraph article whose main topic is the exploration of Venus by the Magellan space probe. A reader divided this text into the following segments, with the labels shown, where the numbers indicate paragraph numbers:
TextTiling is a method for partitioning full-length text documents into coherent multi-paragraph units, like those seen above, that correspond to a sequence of subtopical passages. The algorithm assumes that a set of words is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well.
The approach uses quantitative lexical analyses to determine the segmentation of the documents. The tiles have been found to correspond well to human judgements of the major subtopic boundaries of science magazine articles.
Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. postscript
Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics , 23 (1), pp. 33-64, March 1997.
Marti A. Hearst and Christian Plaunt, Subtopic Structuring for Full-Length Document Access Proceedings of the 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993. postscript html
Marti A. Hearst, TextTiling: A Quantitative Approach to Discourse Segmentation, Technical Report UCB:S2K-93-24, 1993. postscript
BARZILAY, REGINA, & MICHAEL ELHADAD. 1997. Using lexical chains for text summarization. In Proceedings of the ACL Intelligent Scalable Text Summarization Workshop (ISTS'97).
HASNAH, AHMAD, 1996. Full text processing and retrieval: Weight ranking, text structuring, and passage retrieval for Arabic documents. Illinois Institute of Technology dissertation.
KARLGREN, JUSSI. 1996. Stylistic variation in an information retrieval experiment. In Proceedings of the NeMLaP-2 Conference, Ankara, Turkey. (See http://www.sics.se/~jussi/Papers.)
MANDALA, RILA, TOKUNAGA TAKENOBU, TANAKA Hozumi. 1999. Combining Multiple Evidence from Different Types of Thesaurus for Query Expansion. In Proceedings of the 22nd Annual ACM Conference on Information Retrieval (SIGIR 99), Berkeley, CA.
MANI, INDERJEET, DAVID HOUSE, MARK MAYBURY, & MORGAN GREEN. 1997. Towards content-based browsing of broadcast news video. In Intelligent multimedia information retrieval, ed. by Mark Maybury. AAAI/MIT Press.
MITTAL, VIBHU, MARK KANTROWITZ, JADE GOLDSTEIN, & JAIME CARBONELL. 1999. Selecting text spans for document summaries: Heuristics and metrics. In Proceedings of 16th Annual Conference on Artificial Intelligence (AAAI 99), Orlando, FL.
NOMOTO, TADASHI, & YOSHIHIKO NITTA. 1994. A grammatico-statistical approach to discourse partitioning. In Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING), 1145-1150, Kyoto, Japan.
RICHMOND, KORIN, ANDREW SMITH, & EINAT AMITAY. 1997. Detecting subject boundaries within text: A language independent statistical approach. In Proceedings of Second Conference on Empirical Methods in Natural Language Processing, 47-54. ACL.
VAN DER EIJK, PIM, 1994. Comparative Discourse Analysis of Parallel Texts. cmp-lg/9407022
As an alternative, I have made available two versions of the code, written in C. They are both gzip'd tar files. Please do not email me with problems with compiling.