Content Analysis and
Statistical Properties of Text
Today
Content Analysis
Techniques for Content
Analysis
Text Processing
Slide 6
Document Processing Steps
Stemming and
Morphological Analysis
Automated Methods
Errors Generated by Porter
Stemmer (Krovetz 93)
Statistical Properties of
Text
A More Standard Collection
Plotting Word Frequency by
Rank
Rank Freq
Term
1 37 system
2
32 knowledg
3 24 base
4 20 problem
5 18 abstract
6 15 model
7 15 languag
8 15 implem
9
13 reason
10 13 inform
11
11 expert
12 11 analysi
13 10 rule
14 10 program
15 10 oper
16 10 evalu
17 10 comput
18
10 case
19 9 gener
20
9 form
Rank Freq
1 37 system
2 32 knowledg
3 24 base
4 20 problem
5
18 abstract
6 15 model
7
15 languag
8 15 implem
9 13 reason
10 13 inform
11 11 expert
12 11 analysi
13 10 rule
14
10 program
15 10 oper
16
10 evalu
17 10 comput
18 10 case
19 9 gener
20 9 form
43 6
approach
44 5 work
45
5 variabl
46 5 theori
47 5 specif
48 5 softwar
49 5 requir
50 5 potenti
51 5 method
52
5 mean
53 5 inher
54
5 data
55 5 commit
56
5 applic
57 4 tool
58
4 technolog
59 4 techniqu
Zipf Distribution
Zipf Distribution
Zipf
Distribution
(linear and log scale)
What Kinds of Data Exhibit a
Zipf Distribution?
Related Distributions/”Laws”
Very frequent word
stems (Cha-Cha Web Index)
Words that occur few times
(Cha-Cha Web Index)
Consequences of Zipf
Word Frequency vs. Resolving
Power (from van Rijsbergen 79)
Statistical Independence vs.
Statistical Dependence
Statistical Independence
Statistical
Independence
and Dependence
Lexical Associations
Statistical Independence
Interesting Associations
with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
Slide 32
Document Vectors
Document Vectors
One location for each word.
Document Vectors
One location for each word.
Document Vectors
We Can Plot the Vectors
Documents in 3D Space
Content Analysis Summary
Slide 40
Inverted Index
Inverted Indexes
How Are Inverted Files
Created
How Inverted
Files are Created
How Inverted
Files are Created
How Inverted Files are
Created
How Inverted Files are
Created
Inverted indexes
How Inverted Files are Used
Next Time