A corporate goliath is buying out your company in a record-breaking, zillion-dollar deal and will be combining business and technology operations. The two companies’ back-end databases will be merged together into one immense, super-sized whole, and you are responsible for testing the database combination process. Your thoughts dart out in many directions at once: Where do I start? What do I need to do? How much time do I have? Will I even have a job in the new firm when I'm done?
Don't panic. Tell yourself that whether there's a spot for you in the new firm or not, you want to get this experience under your belt before the takeover. You just might prove your worth to the powers that be, and if not, consider this project a résumé builder. Once you've adopted a positive attitude, you're ready to begin.
And don't assume that if you are not involved in a database merger of this magnitude this article does not apply to you. While it's true that multi-million-dollar companies don't merge on a daily basis, companies are constantly striving to find ways to make their data better—a company is nothing without its data. This often entails combining data or porting data. Many of the principles presented in this article can be applied to those activities.
To make this discussion more concrete, we will use the following fictional example throughout the article:
Archrivals Spacely Sprockets and Cogwell Cogs have buried the hatchet and are combining forces in a new partnership to form the galaxy's largest manufacturing organization. Leveraging each other's strengths in an ever-competitive marketplace, the new organization, called S & C Inc., is combining existing products and customer accounts from each company into the new organization.
Orders for sprockets and cogs will be fully automated, processed through a series of software products from Order Management through Inventory Control to Accounting Systems. The Spacely and Cogwell databases are being merged to drive S & C's proprietary technology system.
A snapshot of the Spacely and Cogwell databases is displayed in Figure 1.
As you can tell from these snapshots there are many differences between the two databases, even though they contain similar data. You could expect similar differences in your project.
At the highest level, every database merger has two objectives: accuracy and completeness of the new database. But just stating that doesn't mean much. Accuracy and completeness mean different things to different people—at a minimum you could assume the following is required:
Who is involved in meeting these requirements? The business expert is responsible for answering questions like which data fields are actually needed. The database administrator's (DBA's) duty is to plan and perform the merger. The DBA team will face various potential problems, decide on solutions, and implement them. The test team has two jobs: 1) to test that the DBA's solutions were correctly implemented and 2) to find the problems the DBA team overlooked—before the users do. These problems are ones that nobody—neither the DBAs nor the business experts—knew needed to be solved.
And don't forget the test or QA lead at the other company. Reach out to them and try to compare ideas, strategies, tools, documentation, or whatever else may be useful. After all, it is also in their best interest that the merger/conversion goes well. And, it may establish a good working relationship for the two teams who may become co-workers after the companies complete the megamerger.
Your best friend on a project like this is information. Knowing about a particular development can mean the difference between a smooth-running project and a disaster. Not correcting a wrong assumption until it’s too late will unnecessarily cause you many sleepless nights. So don't be shy—ask plenty of questions! (Which questions? Once you find answers to as many of your questions as possible, you can start laying out a plan.
Planning the Tests
In this article, we will tackle four different problems that you will probably have to overcome in a database merger:
The first two are problems that have proposed solutions, so you will be testing the solution. The last two are tests of potential problems that have not been addressed, so you will be testing for the problem's existence, rather than testing the solution.
Problem 1: Duplicate Records
Duplicates can present one of the greatest challenges in a database merger. In spite of the challenges, most business-savvy CTOs will use a database merger as an opportunity to clean up the data and start with as "clean" a database as possible. Data pollution can be costly, as it can obscure the data’s potential value.
When you merge two companies that at one time competed, there will no doubt be duplicate data, as customers probably will have used both companies as vendors. These duplicates should be eliminated because 1) they take up valuable space, 2) you will waste money sending the same customer multiple mailings, which will 3) make you appear as if you don’t know who your customers are.
The Proposed Solution
To eliminate duplicate data from a database, DBAs will apply heuristic and deterministic business rules. In this context, a heuristic rule is basically fuzzy data matching. For example, a heuristic rule might say that the name "Bernie" is the same as the name "Bernard." Deterministic rules run after heuristic rules to check whether records are identical. For example, the heuristic rules might determine that the name "Larry Nuanez" in one record is the same as "Lawrence Nuanez" in the other, but a deterministic rule might say that two records are duplicates only if the name and the date of birth match. Because the deterministic rule checks date of birth, Lawrence Nuanez is not considered a duplicate of his father, Larry. Search Software America (SSA) makes a suite of software that can help you implement heuristic and deterministic business rules into your project.
Since data is essentially the lifeblood of any business, you rarely want to lose data. When heuristic and deterministic business rules are applied to data, you would usually combine duplicates instead of keeping one record and dropping the rest. For example, refer to the database snapshots in Figure 1. Notice that Acme Manufacturing appears in both databases. There are some differences, though. One contains "Corp" in the name description and the other does not. The heuristic rules still should consider those to be matching names. Suppose the deterministic rules consider two records to match if both the names and the phone numbers match. After applying that rule, these two records match. However, the addresses do differ, perhaps because they refer to two separate manufacturing plants. The plant on Saturn's third ring uses sprockets while the one on the fourth ring uses cogs. The phone number is the same because that's the main number at their corporate headquarters. Therefore, you would want to keep all the data from both records, even though there's an overlap. The merged record would look like the one in Figure 2.
You can always delete nonessential data later if necessary, or create new records if that is the better course of action. But it's difficult to get data back after you have deleted it. To get it back, you’d have to ask the customer for it again. And the customer would most likely say something like, "Didn't I already give you that information?" making you look like you can't manage your data. So careful planning would include additional columns like those in Figure 2 to act as placeholders for this crucial data.
Testing the Proposed Solution
When heuristic and deterministic business rules are used, you must test them twice. First, you have to test that the rules are set up correctly. The DBAs will set up the rules according to what they consider duplicates (or their interpretation of the business expert's standards), and it is vitally important that you test their theories. For example, the DBA may set up a business rule to check that both the name and the social security number match before declaring that two records are duplicates. You, as the tester, point out that since the merger includes customers from all over the galaxy, many of them will not have social security numbers. You suggest adding more logic to the business rule such as checking date of birth, address, or telephone number. This underscores the need for good communication and planning throughout the project.
The second test (after you've verified that the rules have been set up correctly) is to make sure they act as expected when run against your database. You can insert sample test data to make sure the business rules work correctly. In this part of the test, you are essentially checking for two things: 1) that the business rules correctly identify duplicates and 2) that the correct actions are taken when a duplicate is found.
Once you know what you are trying to accomplish, it is time to actually start doing it. The first thing you have to do is realize that there are probably going to be duplicates in each database before you even merge them. We have never encountered a single database (in existence for more than a couple years) that didn't contain at least a few duplicate records. You might need to apply the business rules to each database before the merger. This is commonly referred to as data scrubbing. You want your data to be in the best possible condition before the merger. You will also have nice, clean data to run scripts against.
The business rules could now be tested against real data by applying the following techniques:
You could wholly or partially automate the previous tests. When you have two large companies that are going to merge all of their data together, you will have an overwhelming amount of data to test for completeness and accuracy. Automating certain pieces of your test process can save you time and money. However, it is most practical to automate a part of your database merger testing process if you already have existing expertise with automated test tools within your team. Since you plan to do a merger only once and will more than likely not use the same scripts over again, it would not be a good idea to explore automation with a project of this magnitude. The learning curve would just be too long. There are, of course, other factors when considering automation, such as available time and size of staff. If you decide that automated testing is necessary and you have neither the tools nor the skills on your existing team, consider getting some outside resources to get you started in the right direction.
Problem 2: Mismatched Columns
Another potential problem is that the table and column names and definitions in one database will probably not match those in the other database exactly. Therefore, the table and column names and definitions in the merged database will be different from the ones each company originally used. For instance, Spacely’'s CustomerID was originally defined as an Alpha 32; Cogwell's Cust_ID was originally defined as Char 20. The merged database’s CustomerID will be defined as Char 32 (see Figure 3).
The Proposed Solution
Make sure that you know exactly how tables and columns are going to be mapped after the merger so you can test for possible data validation errors. The DBA team might keep track of the pre- and post-merger database naming conventions and column definitions in a spreadsheet like the one in Figure 3. If they don't, then you should yourself. There are a few approaches to testing the DBA's solution.
Testing the Proposed Solution
Before the merger takes place, review the particulars of the planned database. It might be convenient to plot it out on a giant schematic and hang it up on a wall. Study the fields being merged and the proposed new ones. This might point out some obvious problems, such as length chopping. For example, Cogwell's Cust_Tel is Char 30 and Spacely’s CustomerPhone is Char 20. If the new field length follows Spacely’s Char 20, as indicated in Figure 3, it is potentially too small for some of Cogwell's phone numbers. Sometimes looking at the big picture helps you recognize issues before they occur.
After a test merger is run, check that the new column definitions actually match the specifications. There is a view within many SQL tools that lists the definitions of a particular table. You can print table reports for each table, and then compare "actual" data definition results to the "expected" data definitions in the specification or spreadsheet.
In addition to checking that columns are properly defined, you should also test that these newly defined columns actually operate as anticipated. For example, there are testing implications from converting a char to alpha. Oracle databases define valid chars to contain any combination of upper- and lowercase letters, digits, spaces, or punctuation, while valid alphas only contain letters. Obviously, since alpha is a subset of char, the new alpha column could have problems if there were numbers or punctuation in the old char fields.
Problem 3: Data Corruption
Another issue that usually comes up when working with databases is data corruption. When data from different sources and in different formats is tampered with, there is always a small chance that it will become corrupt. This is especially true if the databases you are merging are of two different types (e.g., Oracle with DB2). Although there's a low probability of this happening during your merger, if it does occur you will have big problems. So it’s a good idea to check for corrupt data before the combined database goes into production.
How would you know if your data is corrupt? Corrupt data is usually simple to spot due to the fact that it just doesn't look right. If your text is peppered with vowels from the Cyrillic alphabet and other assorted symbols, what probably happened is that your data somehow got converted to a low-level machine language, and it is being represented by a mess of extended ASCII characters. So how do you test for this?
Here's a bare-bones way to test for corrupt tables in our sample database. In the post-merger database, sort the CustomerName column alphabetically (this can usually be accomplished with the "Order By" SQL command). What this will do, besides alphabetizing your customer table, is group records whose customer name begins with a non-letter and place them in the same general area in the database view. This makes it much easier to eyeball the table for groups of strange-looking data.
If the customer table is so large that it would take hours to even scroll through it, instead of alphabetizing the table, you could extract the records beginning with nonprintable characters. Using SQL's "ascii" and "substring" functions, you can construct a query like the following, which returns those records whose first character of the CustomerName field is outside of the set of printable ASCII characters:
select * from CustomerTable
where ascii (substring (CustomerName,1,1)) < 32
or ascii (substring (CustomerName, 1,1)) > 126
Don't be too concerned if this tactic doesn't catch all corrupt data—finding any corrupt data is probably good enough. If any of the data looks garbled, it's possible that it’s in more than one place. It's also likely that garbled-looking data in multiple areas of the database is a symptom of one root cause. So it doesn't particularly matter if you miss some records that might be corrupt. As long as you find one group of corrupt data, you've got yourself a bug to report.
It's also possible that your entire database structure could become corrupt, not just simply the data residing in it. If this happens, it would also be easy to spot. You might get an error (SQL run time) after simply trying to run your queries! Needless to say, a corrupt database is at least as severe a bug as bad data. Whether it's the data or the database that is corrupt, finding out about it beforehand will save needless embarrassment later on.
Problem 4: Front-end Assumptions
Let's say your back-end tests have passed with flying colors. You’ve successfully tested your DBA’s workarounds: You've verified that duplicates have been removed correctly, Spacely tables have been married to the correct corresponding Cogwell tables, and data hasn't become corrupt. Are you finished? What could go wrong?
Here comes the fun part. You get to discover new problems that no one else has thought of. One such problem might be assumptions the front end makes about the data feeding it. Suppose there is a validation clause hard-coded on the front-end Order Management System stating that a particular GUI field, let's say the CustomerID field, must be Alpha 32. Since that field is getting its value from a new database, whose column definitions have changed as a result of the merger, your application might experience a new abnormal behavior when processing that order. This is significant because even though the application functionality has not changed, you would want to ensure that your front-end applications are unaffected by the database merger.
To set up this phase of testing, simply configure your frontend application to talk with a test database containing postmerged data. That is, point your GUI application to a test database (database C), which is the result of the combination of database A and database B. (This might be accomplished through an .ini file setting or command line argument. Ask your development team if you’re not sure how to do this.) You could then execute your standard suite of test cases on the merged database, either manually or with an automated GUI script, depending on your ordinary testing process.
The key to making this part of the testing process successful, however, lies in the test data that is used. To get the most bang for your buck, you need to select your test data wisely by including test transactions from an intermingling of data sources. In our example, we not only want former Spacely customers to continue to be able to order sprockets and Cogwell customers to still be able to order cogs, we also want Spacely customers to order cogs and Cogwell customers to order sprockets. These tests, derived from what is known as a Cartesian product, look like those in Figure 4. A Cartesian product occurs when each unit of one group is matched with each unit of another group so that every possible combination of units is achieved across all groups.
Needless to say, this strategy could get out of hand very easily as soon as you add additional parameters into the mix. For example, adding one binary parameter (like Overnight Delivery Yes/No) doubles the test cases to eight. In a real-world example, there could be ten or twenty parameters, each with multiple values. The good news is you probably don’t have to test all these combinations.
There are additional data sets to consider for this scenario if you can't assume that your customers and products uniquely belong to their respective databases. If a customer record in the postmerged S & C database is the result of a merger that reconciled variant values from the parent databases (e.g., wasn't simply copied over from one of the two parent customer tables), the four data sets in Figure 4 aren’t good enough to test placing an order. In this case, a more sophisticated test data set would have to include combinations of those merged records. For example, a particular customer such as Acme (because its customer data is derived from both source databases) would order a particular sprocket or cog. Similar test data sets should be developed for other test cases, such as modifying an order, deleting an order, ordering multiple items, and whatever other functionality the system supports.
One more word to the wise about front-end testing: If you plan to test the GUI front end with existing automation scripts, make sure that they are data driven. You don’t want these input variables hard-coded in your script. When the script is ready to make an entry into the GUI, it should call the data from a designated input file.
If you are ever faced with a project that involves combining or porting data, testing the process can seem daunting. Completing these often high-priority projects successfully can seem like a faraway dream. It is best not to view it as one gargantuan project, but rather as several smaller projects that can be completed. The best way to start is by asking questions. Ask questions until you get the answers. The more answers you get, the more complete your test plan will be.
Remember to test the developed solutions to problems that will likely occur, and also to test for problems that perhaps no one has uncovered. If you follow your plan, automate where possible, and use the tools that already exist in your QA toolbox, you can be confident that no matter what else happens with the merger, your database testing will be smooth and successful.