Content Analysis and Statistical Properties of Text

Today

Content Analysis

Techniques for Content Analysis

Text Processing

Slide 6

Document Processing Steps

Stemming and
Morphological Analysis

Automated Methods

Errors Generated by Porter Stemmer (Krovetz 93)

Statistical Properties of Text

A More Standard Collection

Plotting Word Frequency by Rank

Rank   Freq   Term
1        37      system
2        32      knowledg
3        24      base
4        20      problem
5        18      abstract
6        15      model
7        15      languag
8        15      implem
9        13      reason
10       13      inform
11       11      expert
12       11      analysi
13       10      rule
14       10      program
15       10      oper
16       10      evalu
17       10      comput
18       10      case
19       9       gener
20       9       form

Rank  Freq
1        37      system
2        32      knowledg
3        24      base
4        20      problem
5        18      abstract
6        15      model
7        15      languag
8        15      implem
9        13      reason
10       13      inform
11       11      expert
12       11      analysi
13       10      rule
14       10      program
15       10      oper
16       10      evalu
17       10      comput
18       10      case
19       9       gener
20       9       form

43       6       approach
44       5       work
45       5       variabl
46       5       theori
47       5       specif
48       5       softwar
49       5       requir
50       5       potenti
51       5       method
52       5       mean
53       5       inher
54       5       data
55       5       commit
56       5       applic
57       4       tool
58       4       technolog
59       4       techniqu

Zipf Distribution

Zipf Distribution

Zipf Distribution
(linear and log scale)

What Kinds of Data Exhibit a Zipf Distribution?

Related Distributions/”Laws”

Very frequent word stems      (Cha-Cha Web Index)

Words that occur few times
(Cha-Cha Web Index)

Consequences of Zipf

Word Frequency vs. Resolving Power  (from van Rijsbergen 79)

Statistical Independence vs. Statistical Dependence

Statistical Independence

Statistical Independence
 and Dependence

Lexical Associations

Statistical Independence

Interesting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)

Slide 32

Document Vectors

Document Vectors
One location for each word.

Document Vectors
One location for each word.

Document Vectors

We Can Plot the Vectors

Documents in 3D Space

Content Analysis Summary

Slide 40

Inverted Index

Inverted Indexes

How Are Inverted Files Created

How Inverted
Files are Created

How Inverted
Files are Created

How Inverted Files are Created

How Inverted Files are Created

Inverted indexes

How Inverted Files are Used

Next Time