Copyright 2006 Robert J. Glushko
A Rigorous Approach to Identifying Aggregate Components
The heuristic and informal approach we just followed was called "document analysis" in the SGML/publishing/content management tradition
But these techniques are shaped by the skill and biases of the document analyst; they don't yield uniform results and don't scale as XML becomes the modeling foundation for transactions and distributed applications / complex business processes
If there is to be a discipline of Document Engineering, we need a more teachable and predictable modeling approach that is consistent with classical document analysis
The more transactional our design situation is the more we can (and need) to apply additional rules or requirements that apply to "data-intensive" components
Data Integrity-- the information must be correct
Referential Integrity -- repeated data components (in an "information supply chain" or set of documents) must agree in values
Identification Integrity -- the identify of data should be unambiguous
These three principles embody the concept of Essentiality -- the model contains only the essential information with no duplication
Relational Theory is the basis for identifying essential components
Mother of all data modeling approaches (Codd, 1970s; Date, 1980s)
Relational theory gets its name from the fact it defines relations, two dimensional matrix views of data, which we know better as tables
The focus of relational theory is ...
Keys (essential data identification)
- uniquely referencing structures
- prevents identification integrity problems
Functional Dependency (essential data aggregates)
- aggregating data into logical groups or sets (or relations, tables, entities, object classes)
- prevents referential integrity problems
Normalization -- formal techniques for identifying and defining functional dependencies; yields a set of progressively more rigorous "normal forms"
A Key uniquely identifies an instance in a set of content components
Given a key value we could identify one, and only one, instance of a set of content components
- examples: Student ID, Username, CourseID, Building number
Candidate keys
- both Student ID and Username may be keys to the same thing, Student
Primary keys
- not all Usernames are Students, so Student ID is the better (Primary) Key
'Recognizing functional dependency is an essential part of understanding the meaning or semantics of the data' Chris Date 1981
Functional dependency means that if the values of a set of components change when another component's value changes, then the former component is dependent on the latter.
For each Person we identify, there is a different Address and DateOfBirth, which means that their values are functionally dependent on Person
Normalization is a formal technique for identifying and defining functional dependencies
The result is a set of models that describe the network of relationships within and between groups of logical components in optimal ways that minimize redundancy (and prevent the loss or corruption of information when instances are added or deleted)
Six steps: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
1NF= First Normal Form, etc. (BCNF = Boyce/Codd Normal Form)
Often 3NF is considered adequate, but full normalization gives greater understanding of semantics
The consolidated list of unique candidate components is equivalent to 1NF in relational theory
Make all of the components discrete – only take a single value in their set
Remove repeating sets of data into their own (new) set
Identify components that are keys
For example, an Order may contain components for item descriptions, prices and quantities.
Because there can be many repetitions of these components, we need to introduce a component like LineItem as an aggregate data set that contains them
Now the model for Order is that it contains one of more unique LineItems (discrete with no repetition)
2NF separates all non-dependent components
Split off into separate (new) sets any components that do not wholly depend on the entire key
Focus on sets of components with composite keys
For example, the Quantity depends on the precise LineItem on a specific Order. It applies only to this particular occurrence of a LineItem.
In contrast, Description and Price may be the same each time
They are not dependent on the LineItem (they more likely depend on a product catalogue that lists item descriptions and prices).
2NF would separate these non-dependent components into their own set, possibly called Product or Item
Ensure that all non-key components are independent of one another
As for 2NF but for components that are not keys.
For example, Order may contain a CustomerName and their AccountCode
These non-key values may have some dependency on each other and should be separated into another set, e.g. Customer.
3NF also involves removing any derived or calculated components
Also remove derived components
- calculated from other components, or derived through logic from other values
- they are a type of duplication through transitive dependency
- examples: TotalAmount, NumberofItems, Discount% with DiscountAmount
Assume we've analyzed my lecture notes to yield a set of candidate components
We can represent a populated set of content components for a lecture in a single data set (as a table or "relation'")



If I taught the same lectures every year the content of my slides could be arranged like this
Lots of redundancy here: we start by analyzing how Lecture number, Theme, Year of Creation, Author Name and Class Identifier repeat
They repeat for each unique combination of Date of Presentation and Learning Material Theme
That makes that combination the Primary Key for that set of information components

We can use keys to factor the original table into two tables with less redundancy
The LECTURE table, which contains the content and has a primary key of Lecture number
The PRESENTATION table, which specifies when the lecture was given


The PRESENTATION table is joined to the LECTURE table by including its primary key (lecture number) as a foreign key

However, PRESENTATION still has repeating sets of values (Learning Material Number and Learning Material Description) so it is not yet in First Normal Form

We need to separate out these components into their own set, which we will call LEARNING MATERIAL
- the Primary Key of LEARNING MATERIAL could be Learning Material Number

After extracting LEARNING MATERIAL from PRESENTATION it must include the Primary Key of LEARNING MATERIAL (Learning Material Number) as a Foreign Key
PRESENTATION now consists only of keys

The set we called LECTURE still has repeating sets of values (Year of Creation, Author Name and Class Identifier )
We need to separate out these components into their own set, which we will call CLASS

- the Primary Key of CLASS could be Class Identifier

The remaining components of LECTURE must now include the Primary Key of CLASS (Class Identifier) as a Foreign Key
In Second Normal Form we look at data sets with composite keys to ensure all components depend on the entire key
In our model, only PRESENTATION has a composite key
- but neither of the non-key components (Learning Material Number and Lecture number) are dependent on just Date of Presentation or just Learning Material Theme, they depend on both.
So our model is already in Second Normal Form as well
In Third Normal Form we ensure that all non-key components are independent of one another
In our model, only CLASS has more than one non-key component
- but neither of the non-key components (Year of Creation, Author Name) are dependent on each other. An author could have other creative years and vice versa.
So our model is also already in Third Normal Form as well


The Primary Key Path shows how we can navigate among the relations / aggregates in the model
Transform the model into a UML Class Diagram (data sets/tables become Object Classes, individual components become Attributes, Primary Key Paths become Associations)

Chapter 14 of Document Engineering