Copyright 2006 Robert J. Glushko
Analyzing tables
Consolidating the component harvests
Designing components for reuse
Organizing components for reuse
Structural aggregates
Conceptual aggregates

Most tables follow regular matrix or structural patterns in which the organization and presentation of information is consistent with or reinforces the relationships between the content of the cells or regions defined by the matrix
A transactional document type is often little more than a tablewith some additional information about the parties to the transaction
Many hybrid document types intersperse semi-structured components with structured sets of components in tabular format
When should an embedded table be best understood as a "mini-document type" of its own?
The nature of these relationships is often explicitly represented in the headings for rows, columns, or other structural elements
When the relationships are not explicit, they can often be determined by analyzing the datatypes and content of the cells or the manner in which the content varies from cell to cell
<Courses>
<Course>
<Title>Document Engineering</Title>
<Instructor>Glushko</Instructor>
<Schedule>MW 2:00-3:30</Schedule>
<Location>South Hall 202</Location>
</Course>
<Course>
<Title>XML Technologies</Title>
<Instructor>Milowski</Instructor>
<Schedule>TTh 2:00-3:30</Schedule>
<Location>South Hall 110</Location>
</Course>
<Course>
<Title>Professional Skills</Title>
<Instructor>Downes</Instructor>
<Schedule>Th 12:00-2:00</Schedule>
<Location>South Hall 205</Location>
</Course>
</Courses>
<table>
<tr>
<td>Document Engineering</td>
<td>Glushko</td>
<td>MW 2:00-3:30</td>
<td>South Hall 202</td>
</tr>
<tr>
<td>XML Technologies</td>
<td>Milowski</td>
<td>TTh 2:00-3:30</td>
<td>South Hall 110</td>
</tr>
<tr>
<td>Secrets of Consulting</td>
<td>Downes</td>
<td>Th 12:00-2:00</td>
<td>South Hall 205</td>
</tr>
</table>
And in addition to the problem that most tables aren't represented in ways that capture their "tablehood" essence, it has been estimated that 95% of the information marked up as <TABLE> on the web is not really a table
What characteristics distinguish genuine tables from non-genuine ones?
And some small percentage of things that are tables according to our definition defy content encoding because they combine content, structure, and presentation in ways that are often impossible to untangle or that are highly idiosyncratic but conventional
Deconstructing tables into their content types is an instance of the more general goal of transforming presentation to content
Other presentation components and conventions that carry semantic information should be made explicit as content components
The mere existence or non-existence of values within the cells of a table can have semantic significance.
Color coding: Red text or box around text -> warning
Adjacency: figure and caption -> illustration aggregate
We can begin our consolidation with the candidate components from any of the information sources, but we recommend using the one you believe is the most authoritative or that yielded the most components
The goal is to combine components that are synonyms (different names for the same meaning) and to distinguish any homonyms (same names for different meanings)
Modeling Sylvia and Section 12.2 of DE work through this in detail
It is desirable for a set of components to enable one and only one way to describe something because duplication or redundancy implies choices that could lead to inconsistent models and non-interoperable schemas
The component names we harvest from information sources might not be consistent; resolving synonyms and homonyms is almost always necessary
How rigorous we need to be in naming (and re-naming) components depends on the size of the inventory and the scope of the project
At a minimum we should create a component name dictionary (terms and definitions)
We might need more explicit roles for naming components and aggregates to express the relationships between components
Precise rules can ensure that components can be reliably stored and located in a data dictionary

We have now reached the point where we have captured the business rules and content components of the domain / document inventory in which we're working
We have separated the Presentational, Structural and Content Components
We have developed a conceptual model of our consolidated and essential "atomic" content components -- semantic equivalence classes
Now we have to ensure that we can reuse these components when we assemble document models from them
Document artifacts differ a great deal in how they combine content, structure, and presentation components
Some combinations are idiosyncratic and ad hoc or represent compromises between incompatible requirements that make structures less than optimal
If we are completely constrained by the artifacts as they exist in our component model, we will preserve both their good and bad aspects – which may be influenced by factors which are not part of our new requirements
So our analysis models of components and aggregates may need to be revised to allow alternative ways of satisfying our requirements that relaxes the (implicit) constraint to preserve the original artifacts
The component model may present many attractive options for re-design and reuse of our content components
Design means changing our model, not simply improving the way we view it. This is when we actually get to apply our insights about reuse and patterns
During design we can devise more consistent component names, remove repeating or reoccurring content and structure, increase reuse of standard patterns or components, replace implicit components with explicit ones, and otherwise create a more abstract, concise, and context-free representation of the essential characteristics
Once design begins, we cannot guarantee that we will be able to recreate the original artifacts from the model
Unless, of course, we have requirement to be able to do so
Designing a factory that makes "build-to-order" computers:
You might start with some collection of computers and take them apart to see what pieces are needed to assemble them (ANALYSIS)
Because you want to be able to make these items with reasonable quality but at less cost and at greater efficiency, you redesign the computers to use standard components (DESIGN FOR REUSE)
You organize the components and the assembly lines to make it easy to locate components when you get an order (ORGANIZE FOR REUSE)
Designing a conceptual model of some domain:
You might start with a set of hand-crafted applications with printed or online data entry forms and take them apart to see what pieces of information each of them needs (ANALYSIS)
Because you want the complete "enterprise model" for the domain to be able to represent any application or form with reasonable quality but at less cost and at greater efficiency, you redesign the pieces of information from analysis to be more standard and context-free (DESIGN FOR REUSE)
You organize the components to make it easy to locate the components when you build the specific contextualized model for an application or form (ORGANIZE FOR REUSE)
ISO 11179 covers "Specification and Standardization of Data Elements"
It proposes a three-part naming scheme for each unique data element that nicely fits the methodology of identifying object classes, their properties, and their data types
The name consists of:
The object class to which the data element belongs
The property term reflecting a property or distinguishing characteristic of the object class
A representation term characterizing the data element’s valid values
ISO 11179 datatypes/representation terms are: Amount, Code, DateTime, Graphic, Identifier, Indicator, Measure, Numeric, Picture, Quantity, Text
Examples 11179 names: Event.Title.Text, Car.Color.Code
ISO 11179 is mandated in the Federal Enterprise Architecture and is used in many standards activities for XML vocabularies (including UBL)
It would be overkill for small projects, but establishes a rigorous naming benchmark
Consistent grammar for qualified names is a good compromise for many projects
You might find yourself doing a partial enumeration in a domain where a standard set of enumerated values already exists
If would be better to use the standard codes
But standard codes can sometimes cause problems
You can add a context qualifier to specialize a component rather than defining a completely new one (this reuses the "base" component type)
In the Engineering Compendium both Figures and Tables have Captions, and the Caption is similar enough in both to allow it to be re-used by both.
This suggests components for FigureCaption, TableCaption
You can also remove or "factor out" context to define a more general or abstract component that can be used more broadly
"Delivery Date" and "Ship Date" suggest a "Date" component
Your set of components has to find a balance between precision and generality (or flexibility)
You want a set of components that can be reused across related document types in some context (or group of related contexts)
Contexts fit into a continuum -- At one end is an extremely specific context in which component definitions are suitable for a very narrow class of documents (example: a component model that describes the existing course catalog for a single department or school)
At the other end is an extremely loose or general context, suitable for a very broad class of documents (example: a component model that describes a catalog as a set of items of any type)
Components – the units of content
Any piece of information that has a unique label or identifier is a candidate component
Any piece of information that is self-contained and comprehensible on its own is a candidate component
A component is a logical unit, with no presentation implied; it may be organized structurally
These definitions are very helpful for finding (aggregate) components in some types of documents but less so in others
It depends on the presence of, and relationships with, the structural and presentational information
Atomic components that hold individual pieces of information
Especially in transactional documents,where atomic components have a natural representation as primitive data types ("string," "Boolean," "date") or as datatypes that are derived from these by restriction
Document components that assemble smaller components into the set of information needed to carry out a self-contained purposeful activity
Especially in transactional contexts, where documents have a natural correspondence to some unit of work that initiates, records, or responds to a clearly-defined event
Aggregate components are composed of atomic ones and are reused in the assembly of document components
They are easier to identify in transactional contexts because they are often the key information that flows from one document to another
"Address" or "Person" are obvious examples of aggregates composed of smaller ones
Two key questions:
How do we select and group atomic components into aggregates?
How many aggregates should we create?
Structural Aggregates -- sets of components defined by parent-child or containment relationships
Conceptual Aggregates -- sets of components that "go together" because of logical dependency
Aggregates are more elusive on the narrative end of the DTS because there are limits to the rigor with which components can be grouped
"Mixed content" models arise when there are few or weak constraints on where atomic components can appear
Presentation often masks the atomic components in potential aggregates
Structures are often based on conventions for organization and presentation than on semantic relationships
But there will still generally be components that "go together" to form reusable structures
And "going together" means different things for each set of components
Aggregates can be created in two "bottom-up" ways that focus on the atomic components:
The first is by rebuilding or making explicit the structures that we took apart in document analysis
The second is by creating structure in "blobs" of poorly structured information written in an overly narrative style (with mixed content at best)
A more modular style for the information will increase its regularity and reusability; it will eliminate content that has little value to users and reinforce its use as "boilerplate" or via links



Different aggregates might have the same components
"Contract" and "Shipment" might both have "Start Date," "End Date" and "Duration"
The repeated components can be extracted and created as a reusable aggregate
In this example we might call the common pattern the "Period"
Many of the patterns that you might identify as repetitions would have also been identified in a previous analysis of your domain or context
You should determine if their analysis yielded components for you to reuse
Pay particular attention to "standards" if they come from credible sources
Allison Bloodworth discussed her analysis of iCAL ("Internet Calendaring and Scheduling Object Core Specification") and SKICal ("Structured Knowledge Initiative Calendar") in the report on the event modeling
But don't accept someone else's analysis and models if you don't understand them
And NEVER assume that a component model is appropriate solely on the basis of its name because...
UBL has invested three years in creating conceptual component models so that people can confidently reuse the XML schemas based on them
No class meeting -- lecture will be recorded
Read Sections 3.2.4-end of "Model-driven Application Design for a Campus Calendar Network" (Bloodworth and Glushko).
Database Normalization (Gilfillan)