Business Intelligence and Data Quality

[article]

Informed companies make better decisions, and business intelligence (BI) solutions are vital to the health of enterprises of all sizes. The need for BI solutions is not about the technology but rather the information that the solutions deliver, so business owners should drive the need for BI solutions. The technology team does not create the data or use it in a business context, but it does provide mechanisms to store, retrieve, and archive the data accurately and securely.

BI tools organize dissimilar and diverse data across the enterprise, enabling multidimensional relationships of the data that can provide keener insights that empower decisions to leverage opportunities with on‐demand intelligence. More than just spreadsheets on steroids, BI allows the ability to tailor the information to the user's perspective.

In today's economic climate, there is a focus on reducing waste and maximizing benefits on information already known about consumers and end-users. A business that invests time and money into its BI initiative has built and substantiated a business case around why that investment will yield a positive business return from both revenue and cost perspectives. It has a solid framework and data architecture, good governance, aligned leadership, and the right tools. But, this business's analysis is only as good as the quality of the data. If the testing process is weak and data quality and data integrity tests are suspect, the business could be at risk.

Where to Start
As with any strategy, BI initiatives must be defined, organized, developed, and tested. It is especially important to define a data stewardship role—a person or committee who will own and police data integrity between key functions in the company. This role becomes essential when integrating data across conflicting business rules.

The transformation of data into information and then into intelligence (or, "intelligent decision making") is the primary goal of BI tools. Starting with the source data coming from multiple systems, the scope of testing will need to be considered due to the various combinations of migration, conversions, and transformations of data across the enterprise. Add to these combinations the many relationships and dimensions of the BI tool, and the number of test combinations is in the billions. For complex systems, the combinations could easily approach the trillions.


Figure 1: Data flow

Extract, Transform, and Load
In computing, extract, transform and load (ETL) refers to a process in database usage and especially in data warehousing that involves: 1) extracting data from outside sources 2) transforming it to fit operational needs (which can include quality levels) 3) loading it into the end target (the database or, more specifically, operational data store, data mart, or data warehouse). [1]

Begin by affirming that the level of validation from a business's source systems is done in a consistent manner. Investigate each source system's business data and understand the relationships (meta-data management) across the enterprise. Therein lies the root cause for potential data problems. Examine out-bound data files mapping. Ask questions. What were the data relationships, and what business rules were applied? How and where were they applied during the ETL process? Did the tests cover stored procedures and functions, triggers, indexing, field size, field type, and formats? Did the data marts or data extracts need to include referential integrity tests to insure the reliability and integrity of the data?

Ensure that a common data dictionary and definition is in place to facilitate certification efforts. Be concerned about legacy data, as it is likely that the application has undergone many transformations throughout its life. With new applications or business rules, data gets appended and enhanced and might not resemble its original state. Look for the originating event, which could be embedded in the ETL process. These data anomalies should be clearly vetted across the organization and mitigated through a common approach.

Data Loading
Once certification of a set of data loads or controlled batches is complete, take image copies to re-baseline the database. Similarly, source systems should take image copies of their data to facilitate reconciliation. Quality and integrity issues are often found at this point, and restoring back to a certified copy will enable more efficient testing.

One of the key questions to ask regarding these findings is whether to repair the data or reload. If a data anomaly is the result of a flaw in the ETL logic, then a business typically should make the ETL change and reload the data or run a script to correct it. Investigate the source system for database schema changes. Once the database structure changes, analyze the impact of the changes and modify any impacted tests.

While your response will depend on the degree of severity, it is good practice to make a determination up front as to the general rules regarding such events—another reason to have data stewardship.

Testing that the data has transformed according to specifications and business rules can be complicated. Depending on how well an application was designed, data can go through several transformations in the transaction lifecycle. Once initial testing is completed, create a first load of data for validation testing. This is necessary, as the reports that the end-users will use to validate will contain the same data. It is prudent to consider the need for several years of data, so you should decide how far back you should go. Play close attention to year-end processing and reporting, as there could be some skeletons in those closets, as well.

How Much Data?
There are many different answers to this question. One is to approach the level of data under test that will be used for the BI process as the end goal. It is the data supporting the entire enterprise. Whatever the grouping will be—line of business, portfolio, business critical, etc.—start with a baseline of data, know the business’s relationships, and build tests that control complexity, variables, and size. Also, try to obtain production reports from the source systems to validate baseline data and facilitate comparison.

Data Completeness
Data completeness validates that all expected data loads are integrated into the data mart, data store, or warehouse. This includes validating that all records, fields, and the full contents of each field are loaded, including:

  • Record counts between source data and data loaded to the receiving database repository
  • Tracking and writing out to file rejected records to a suspense file
  • Unique values of key fields between source data and data loaded to the repository
  • Populating the entire contents of each field to validate that no truncation occurs at any step in the process
  • Testing the boundaries of each field to find any database limitations
  • Testing blank and invalid data
  • Data profiling (range and distribution of values) between source and target
  • Parent-to-child relationships (part of referential integrity)
  • Handling and processing orphaned records

The Business Data Layer
Two examples of BI tools are the semantic layer and the OLAP cube. A semantic layer (developed by the company Business Objects) is defined by Wikipedia as "a business representation of corporate data that helps end-users access data autonomously using common business terms." [2] The key point is that it creates a consolidated view of the data across the enterprise, in which all the data elements, table joining, relationships, and other configurations are defined.

OLAP is short for "online analytical processing," and an OLAP cube, also according to Wikipedia, "is an array of data understood in terms of its 0 or more dimensions." [3] In this case, a business needs to design tests focused on certifying the relationships of parent-child data. It is the multi-dimensional capabilities that drive the power of the tool.


Figure 2: An OLAP cube

Conclusion
Database, warehouse, and BI testing are all much more challenging than testing the standard GUI application. This area of testing requires deep technical knowledge. Therefore, it is not uncommon for the ETL developers or DBA's to perform these tests.

The investment in BI tools and data analytics is deep, so ensure that investments in quality and testing are commensurate with the risks that the business is facing. Formalize methodologies including the spectrum of data warehousing and BI testing where necessary (measured by increased business risk). If a business is already dealing with regulated data or environments, then it is already aware of the more stringent controls that need to be in place.

Finally, the investment made in BI tools will not yield a good return on investment if the quality of the data is suspect. Directions and decisions made from poor data quality could be substantial and negatively impact your company.


References

  1. "Extract, Transform, Load," Wikipedia, http://en.wikipedia.org/wiki/Extract,_transform,_load.
  2. "Semantic Layer," Wikipedia, http://en.wikipedia.org/wiki/Semantic_layer.
  3. "OLAP Cube," Wikipedia, http://en.wikipedia.org/wiki/OLAP_cube.

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.