About TextTiling

TextTiling is a technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.

Articles that are describe and inform -- such as science magazine articles and environmental impact reports -- can be viewed as being composed of a few main topics and a series of short, sometimes densely discussed, subtopics. For example, consider a 23-paragraph article whose main topic is the exploration of Venus by the Magellan space probe. A reader divided this text into the following segments, with the labels shown, where the numbers indicate paragraph numbers:

TextTiling is a method for partitioning full-length text documents into coherent multi-paragraph units, like those seen above, that correspond to a sequence of subtopical passages. The algorithm assumes that a set of words is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well.

The approach uses quantitative lexical analyses to determine the segmentation of the documents. The tiles have been found to correspond well to human judgements of the major subtopic boundaries of science magazine articles.


Papers by Marti Hearst about TextTiling:


Papers by other authors that use TextTiling:

The algorithm has been applied by other authors to other languages, including French and German (van der Eijk), Arabic (Hasnah), and Japanese (Nogoto et al.). It has been used for several applications including genre detection (Karlgren) and summarization (Mittal et al.).


Source Code for TextTiling