Shashank Saxena

View Original

Case Study: Designing a dimensional model for a cargo shipper

We’ll draw on a case study of a transoceanic shipping company to explore the process of designing a data warehouse schema. A schema or a dimensional model is a logical description of the entire data warehouse. We’ll consider a star schema, which is perhaps the most straightforward data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. It is characterized by one or more extensive fact tables that contain the primary information in the data warehouse and several much smaller dimension tables, each of which includes information on the entries for a particular attribute in the fact table. We will use the dimensional modeling techniques introduced by Ralph Kimball and will reference his seminal book ‘The Data Warehouse Toolkit,’ Third Edition (coauthored by Marygy Ross, 2013).

Gather Business Requirements

Before launching a dimensional modeling effort, we need to understand the Shipping Company's needs and the underlying source data's realities. We can uncover the company's requirements via sessions with business representatives to understand their objectives based on key performance indicators, compelling business issues, decision-making processes, and supporting analytic needs.

We cannot design the dimensional model in isolation as we don't fully understand the business. The required dimensional model should be designed in collaboration with subject matter experts and data governance representatives from the company. As data modelers, we will remain in charge, but the model should unfold via a series of highly interactive workshops with business representatives. So, collaboration is critical.

Understanding the case after gathering data realities

After gathering the requirements, we can list them clearly to understand the case. In this case, the shipping company serves a limited group of customers, all registered with the company. It transports bulk goods in containers between ports from one customer to another. The trip can have multiple stops at intermediate ports.

A shipment invoice contains the following information: invoice number, date picked up, date of delivery, ship from customer, ship to customer, ship from warehouse, ship from city, ship to city, shipment mode (like air/sea/truck/train), shipment class (codes like 1,2,3 which translate to express, expedited, standard), contract ID, total shipping charges, deal ID (refers to a discount deal in effect), discount amount, taxes, total billed amount, total ship weight, total ship volume. Besides, each invoice line contains product ID, product name, quantity, ship weight, ship volume, charge. Also, the same product will not show up in multiple lines in the same invoice.

The customer table also contains information on customer size (small, medium, or large), customer ship frequency (low, medium, or high), credit status(excellent, good, average). These tend to be correlated; for example, large customers tend to be high frequency and have excellent or good credit status. Each customer can have multiple contracts, but the number of contracts is relatively small.

The number of deals is relatively small, and some shipments may have no applicable discount. The company may need to query information like total revenue by deal. It may also need to identify deals in existence for a given product, source-city, destination-city, and date. Moreover, we can find an Estimated Date of Delivery for each shipment from operational systems, even though it is not printed on the invoice. There is some correlation present among Shipment Mode and Shipment Class. Products roll up to brand and category, while city/warehouses roll up to subregion, region, and territory.

Some queries that the company may be interested in are query by customer type, product category, region, year, shipment mode, contract terms for discount, ship weight, volume, delay in delivery, and revenue.  

Modeling 

The Kimball method is a four-step approach to dimensional modeling. It guides the DW design and is as follows:

Step 1: Select the Business Processes

Step 2: Declare the Grain

Step 3: Identify the Dimensions

Step 4: Identify the Facts

The business process, in our case, is Shipment Invoicing. Granularity is the highest level of detail made available in the dimensional model. The more detail we have about the fact measurement, i.e., the smaller the atomic grain, the more flexibility we have as we know more things. The grain in our case is one row per invoice line item.

The third step is the identification of the dimensions. We will represent the dimensions in table form below:

Fig.1 The Dimension Tables

The fourth step in the Kimball method is identifying the facts. The fact table is represented below:

Fig.2 Shipment Invoice Fact Table

Further notes about the dimensional model:

  • The role-playing dimensions in the form of views are in City, Date, Customer Type & Customer dimensions. The roles are mentioned in the respective tables. These role-playing dimensions serve as permissible outriggers in the model.

  • Shipment mode and Shipment Class are correlated and have low cardinality of attributes. So, a Type 4 Mini-Dimension approach is taken for Shipment Dimension.

  • The discount descriptions are found in the Deal dimension, which is a causal dimension. We will include a special record in the dimension table with a descriptive string such as 'No Deal Applicable' when a deal is not in place to avoid null keys. Discounts are allocated per invoice line item.

  • Composite key in bridge dimension: Bridge table with dual keys to capture the many-to-many relationship between customers and contracts is used in conjunction with the measurement fact table. The bridge table's primary key is a composite key made up of customer ID and contract ID.

  • A star query is a join between a fact table and several dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans for them.

The final dimensional model can be represented as below:

We should remember that our fact grain is the invoice line item. So, in cases where the products are not sold, the Fact Table will not give us the deal existing on a given date. We need to add attributes for Deals to gain this functionality. Moreover, since this will be a slowly changing dimension (SCD) with changing deals, we can implement a type 2 SCD approach. We will need to add attributes for deals – Valid-From date, Valid-To date. Further, adding attributes for the product on which the deal is applicable and source-destination city where the deal is applicable will help track the required deal details for any given date.

This case highlights the importance of understanding the business needs and the four-step process for dimensional modeling. This leads to a model which not only solves the business requirements but also ensures business buy-in.

References:

  1. ‘The Data Warehouse Toolkit’, Third Edition, by Ralph Kimball and Marygy Ross, 2013.

  2. Oracle Data Warehousing guide, https://docs.oracle.com/cd/B10500_01/server.920/a96520/toc.htm

< Back to Data Warehousing

See this content in the original post