Attacking Quality Issues in Data Warehousing

[magazine]
Volume-Issue: 
Summary:

To fully detect, isolate, and resolve quality issues in a traditional, large-scale data warehouse requires that several approaches be used together. Wayne identifies types of data quality issues and then  illustrates how to best attack and resolve those pesky issues.

Implementation and growth of data warehouses continue to gain attraction as organizations become more aware of the benefits of decision and analytic-oriented databases. Nevertheless, there is often one important obstacle to the rapid development of commercial data warehouses: data quality. Serious problems are often discovered when planning and populating a warehouse that, if not resolved, can delay or eventually result in terminating the project.

During the past twenty years, researchers have contributed to the understanding of data quality issues, yet little research has been collectively compiled to identify root causes of data quality problems that occur throughout major phases of data warehousing.

Based on my experience, the following are primary causes of data quality defects in data warehousing [1]:

• Flaws in the data warehouse modeling and schema design
• Defects in data sources used as input to the data warehouse
• Failure to effectively profile source and target data
• Weaknesses in the design and implementation process for data warehouse staging and extract, transform, and load processes

Using Early-Phase Defect Prevention Methods

This article highlights the reasons for data deficiencies related to the root causes listed above together with timely quality assurance efforts that can be implemented for discovery and correction. It is hoped that data warehouse designers, developers, and testers cooperate and benefit by examining these quality issues before moving forward with data integration into the data warehouse.

Figure 1 displays a high-level view of the common data warehousing extract, transform, and load process where data quality and functional testing are recommended.

Figure 1

Data Quality Issues Related to Data Model and Schema Design

Design of the data model for the data warehouse greatly influences the quality of the analysis by programs that use the data. A flawed schema will negatively impact information quality.

Data modeling is the process used to define and analyze data requirements needed to support business processes within the scope of application needs. The data modeling process should involve trained data modelers working closely with business stakeholders, developers, quality assurance, and potential users of the information system. Data modeling defines not only data elements, but also their  structures and the relationships between them.

Data modeling methodologies should be used to model data in a standard, consistent, and predictable manner in order to manage the data as a resource. The use of modern data modeling standards and tools is strongly recommended for all projects.

Table 1 shows highlights of how a proper data warehouse design review can make or break your data warehouse. [2]

Quality Issues in the Data Warehouse Source Data

A leading cause of data warehousing and business intelligence project failures is finding and then loading incorrect or poor-quality data. The source system often consists of transaction and production raw  data, which is where the details are pulled from and made suitable for the data warehouse. Each of these data sources usually has its own diverse methods of storing data, which may contribute to data  quality problems if proper care is not taken.

Data warehouse environments provide the source of information used by business units to make strategic decisions. However, much of that data is created outside the warehouse. That means data quality  problems can originate at the source and can therefore persist due to faulty data acquisition and delivery processes, or interpretation and transformation glitches.

Upcoming Events

Nov 09
Nov 09
Nov 09
Nov 09