20. Document Analysis [2]

DE + IA (IS 243) - 5 April 2006

Copyright 2006 Robert J. Glushko

Plan for Today's Class


The Document Type Spectrum


Systematic Variation in Document Types Across the Spectrum


Extracting Presentation Rules


But Sometimes Rules Can't be Extracted


Correlations or Conventions with Presentation Information


The Simplest Information Component Model


Harvesting Components


What Metadata to Record About Candidate Components


Analyzing "Possible Values"


Code Sets


Seek Semantic Clarity and Precision


Relationships Between Text and Non-text [1]


Relationships Between Text and Non-text [2]


Dictionaries, Encyclopedias, and Reference Books


Engineering Compendium – Typical Entry


Encyclopedia Entry


Oxford English Dictionary – Typical Entry


Procedures, Policies, Laws, and Regulations


Code of Federal Regulations


Catalogs


Industrial Parts


Software "Man Page"


Home Blueprint


Recipe


Transaction Documents


Tax Form


Consolidating The Harvest


Readings for 10 April


Assignment 8 - Document Analysis