Copyright 2006 Robert J. Glushko
Project Presentation #1
Tour of the Document Type Spectrum to learn about content, structure, and presentation analysis
Harvesting Components
Consolidating Components

Instances more heterogeneous on narrative end
Types are "broader" and more descriptive, less prescriptive on narrative end
The set of content types within a document type is much greater on the transactional end because the leaves aren't "just text"
More "metadata" augmentation of documents on narrative end
Presentational information more likely to be correlated with content and structure on narrative end
Presentation affects structure and content by applying transformation rules to them
To understand the structure and content we must identify and record what the rules of the transformation were
Explicit transform rules can be encoded in templates, stylesheets or source code
No access to source formats or source code
Rules may be inaccessible in source formats ("override" formatting in word processors instead of style tags)
Rules don't exist or are inconsistently followed (author has "fontitis" with "ransom note" presentation style)
Color, pitch, other perceptual dimensions can be correlated with semantic distinctions
Type size is usually correlated with the structural hierarchy
Content types can have characteristic layouts or text attributes
Adjacency can suggest a semantic relationship, like that between figure and caption
Presentation order is sometimes semantically significant
The simplest or minimal information component model is a glossary – a list of the words used to describe or name the "things of significance" and what they mean
This simple data model is augmented as attributes or characteristics of the significant things are identified and recorded
The model is further developed as relationships or associations or links between the "significant things" are identified and recorded
As we identify candidate content components, we need to record its properties (or attributes or behaviors) that let us understand it and distinguish it from other ones
A practical way to do this for each document or information source being analyzed, create a table or spreadsheet containing the candidate component and the metadata useful in understanding and distinguishing it from other ones
See the examples in the appendices to Modeling Sylvia http://groups.sims.berkeley.edu/sylvia/docs/reports/appendices/
What attributes about each type of content might we record in our analysis?
Names/synonyms/homonyms (what it is called)
Definition (what it "means")
Identifiers
Cardinality/Optionality (occurrence rules)
Restricted values, code sets, defaults
Data Type (text, numbers, date, video)
Relationships/Associations (participation in structures)
Origin (Is this new information, or from some other source? Who maintains it?)
Access (who is allowed to view/change/copy/etc. it)
Permanence (is it static or dynamic? how often does it change?)
Business processes in which it participates
It is critical to capture any rules governing the possible values for a component
Sometimes possible values are conventional, fixed, and span the entire semantic range for some domain (days of week, AM/PM)
Determine who can control the value sets (internal [Manufacturer part #s] vs external [Bar codes])
Patterns like regular expressions are often useful but not sufficient for validation
And if the set of possible values isn't well motivated, fix it in your component design
Code sets are constrained sets of values that are often completely arbitrary
The ISO code sets for countries (3166), currencies (4217), quantities and units of measure (31) are the bedrock ones that you should generally defer to without question
Most organizations have internal code sets or business rules that implicitly define them
It seems obvious that we need "good names" and "good definitions" for the components we identify and design but what does that mean?
Remember the Farish article titled "What's in a Name?" (http://www.vertaasis.com/articles/whats_in_a_name.htm) that recommends three "levels" of models (or names) that line up nicely with our three stages of analysis, design, and encoding
Business names – a format that lets the requirement or semantics be easily readable and verifiable by a business person (not a modeling or XML expert). This should use familiar words and be completely technology-independent
Logical names – a format optimized for the expression of the design or model; essential that they are expressive enough to reflect the relationships between model components. Logical names might follow precise rules to ensure that they can be reliably stored and located in a data dictionary; ("qualified names" specialize general terms to convey the context of use)
Physical names – the format required by the implementation technology for the model
Another useful dimension for thinking about content considers the relationship in documents between the text and non-text information that they contain
Text-dominated – most of the content is conveyed by text components, with non-text components unnecessary or in an incidental role (examples: legal documents, accounting information, invoice)
Text-framework – the document reflects the organization defined by the text components, but non-text components provide content enhancements (examples: encyclopedia, maintenance manual, product catalog, purchase order)
Non-text dominated or text-enhanced – most of the content conveyed by non-text components; which provide the framework for the text; text components carry metadata, annotate or explain intrinsically non-textual content (examples: photos, video, engineering drawing, atlas, art book)
The relationship between text and non-text information can vary at all points on the document type spectrum
Narrative document type can be philosophy (all text) or anatomy (lots of non-text)
Transactional document type can be invoice (all text) or RFQ (lots of non-text)
Usually very carefully designed, with regular structure that is exploited in information access and navigation features to enhance usability
Often have rich repertoire of content component types (pictures, maps, charts, formulas, tables)
Mixed content in paragraphs or other text blocks will contain numerous content types



Usually mostly text, created and used by people
Information that is often extremely important to companies and highly-paid professionals because the cost of finding (or not finding) information can be high
Often has high "intrinsic hypertext" character with many explicit and implicit links between content components
Often follow structural conventions and standards with regular numbering and naming schemes
Versioning and configuration requirements can pose problems
Making this type of content computable or executable is a huge R&D area (XML standards like XACML, policy engines and wizards, expert systems)

Many different types
Some are extracted from ERP system or product database
Often contain a mixture of structured and unstructured content
Often a challenge to match the user's vocabulary and ontology for a product domain




Printed or electronic forms
Data-intensive, designed to capture and present small information components
Inputs and outputs of business processes and often created and consumed by computers
Few and somewhat arbitrary presentational characteristics
Strongly datatyped with field length, range and value, other restrictions

We can begin our consolidation with the candidate components from any of the information sources, but we recommend using the one you believe is the most authoritative or that yielded the most components
The goal is to combine components that are synonyms (different names for the same meaning) and to distinguish any homonyms (same names for different meanings)
Modeling Sylvia and Section 12.2 of DE work through this in detail
It is desirable for a set of components to enable one and only one way to describe something because duplication or redundancy implies choices that could lead to inconsistent models and non-interoperable schemas
Some guidelines or questions to help minimize duplicate components, especially when they are proposed from documents / sources / applications from different authors or organizations:
Are the differences between the proposed components substantive or stylistic? (writing or encoding style)
Are the differences "real" but "unimportant" to users or applications? (spurious precision)
Review DE 12 Again!
(first part of) Model-driven application design for a campus calendar network
Your last individual assignment -- due in two weeks
Analyze five different information sources
Sample blank prescription forms
Software for writing prescriptions
An article about "prescription writer software" that you can treat as an interview
An article about how prescriptions are handled by pharmacies
Harvest components from all the sources
Create a consolidated table of content components
Identify and scope "co-occurring" or "related" processes and documents