As an Information Architect, I have built Conceptual and Logical Data Models (CDM and LDM, respectively) for different business applications and infrastructures for many years. On these projects, I have experienced many different kinds of data abstraction and normalization scenarios. The purpose of this post is to share some of my data modeling best practices with beginner data modelers on data abstraction and normalization.
One of the most important strategies when forming a Logical Data Model is to represent the data requirements in a scalable and flexible structure to support the current business requirements and any possible future changes. Occasionally, the architectural design becomes complex due to the need of balancing between the presentation of business rules and the standard of data modeling rules.
When it comes to designing a Logical Data Model for a transactional system, the first lesson is to understand data abstraction so that each data requirement is documented in 3rd normal form (3NF). As we know, a normalized model is required to reduce the duplication of data definitions and ensure referential integrity across the model entities. There are multiple levels of normalization based on data modeling rules, and a Logical Data Model is usually documented in 3NF.
In 3NF, every attribute or combination of attributes that can uniquely identify an instance is identified as a business key (primary key). All the normalization rules are closely tied to the business key, so that the data is stored in one, and only one, place to reduce data redundancy and increase data referential integrity (data consistency).
Normalization is the process of applying rules to organize business complexity into a stable and flexible data structure. A deeper understanding of each data element (attribute) is required, to see each data element in relationship to every other data element. The basic goal of normalization is to keep each data element in only one place. A level of normalization characterizes the relationships between data attributes. The definitions below are summarized from multiple places, and only the high level rules are restated:
- 1st Normal Form (1NF): An entity type is in first normal form when it contains no repeating groups of data. It ensures each entity has a valid primary key, every data element depends on the primary key, removes repeating groups, and ensures each data element is atomic (not multi-valued).
- 2nd Normal Form (2NF): An entity type is in second normal form when it is in 1NF and when every non-key attribute (any attribute that is not part of the primary key) is fully dependent on the primary key. It ensures each entity has the minimal primary key and that every data element depends on the complete primary key.
- 3rd Normal Form (3NF): An entity type is in third normal form when it is in 2NF and when all of its attributes are directly dependent on the primary key. A better way to word this rule might be that the attributes of an entity type must depend on all portions of the primary key. It ensures each entity has no hidden primary keys and that each data element depends on no data element outside the key.
Examples of Data Normalization Process
So how do we start the thinking about performing data normalization in a real data modeling problem? Let’s use some simple examples to explain the process.
Example 1: Product
To define all the business-required data about a Product, we first create the entity then associate all the required elements (attributes) together in the entity.
The next step is to check which data elements (attributes) in this entity repeat for each Product instance, and do not directly depend on the Product Number (the primary key). It is obvious that Product Group and Product Type information may repeat for multiple Products when those Products fall into the same Group or same Type (see Figure #1 to the left).
Also, the Product Group and Product Type concepts can exist without any actual Product instance. Therefore, both Product Group and Product Type are independent business concepts from the Product primary key (Product Number) and need to be abstracted out from the Product entity (see Figure #2).
Since the Product Type and Product Group are still related information to a Product, we need to create foreign key relationships between Product Group/Product Type and Product entity. The relationships specify the business rules between them, such as a Product Group is the grouping of multiple Products, and a Product can only be categorized into one Product Group.
But we are still not finished – refer back to Figure #1. We see that a Product instance can have a different number of components.
For example, Office 365 contains multiple components including Word, Excel, Outlook, and OneNote; SQL Server has SSIS, SSAS and SSRS service components. Even though each Product Component directly depends on its parent Product, it also has its own attributes.
In this case, we need to extract the Product Component elements out of the Product entity, and tie it to the Product as a child entity. See the lower center of Figure #2. In the normalized structure, it is flexible to support as many components as needed for a Product, and each Product Component attribute is clearly defined in the model as the most granular level of detail.
Example 2: Service Request Status
Another typical example of data normalization is tracking the history of a business transaction. For example, a Service Request can have multiple statuses for the duration of the request, but it only has one status at a given point in time.
When we collect all the required status information together along with the Service Request (SR) entity, we see the statuses are repeating for different SR’s, and the Service Request Status concept can exist without an SR instance. See Figure #3 below which shows the SR entity before the normalization process.
It is obvious that we need to normalize a Service Request Status out of the Service Request entity by following basic normalization rules. Figure #4 below represents Service Request with its current status and the date when the SR changed into its current status. If a business wants to track the full history of the status changes for an SR, another level of abstraction is needed.
In Figure #5, we use an intermediate entity to represent the association between a Service Request and its respective status changes. The Service Request Status Track entity documents which SR is in which Status during what time period. This design supports the most granular level of details on SR status history.
The examples above provide some basic ideas on data abstraction and normalization processes when performing Logical Data Modeling. There are more complicated situations when more complex business rules are applied during a transactional process, which requires a data modeler to have a thorough understanding of the processes, terminologies, and relationships between data elements. Data abstraction and normalization strategy may be presented in a hierarchical design or a self-join. Sometimes the sub-typing is needed or the Bill of Material (BOM) pattern etc. We will discuss these deeper level scenarios in an upcoming post in our data column.