The Next Generation: Software Metrics

2. Data mining capability
OK.  Now that I have all that data, all in one place, how do I get at it.  I may have tens of thousands of files, hundreds of thousands of file revisions, thousands of problem reports, work breakdown structures, requirements trees, test cases, documents, etc.  Each has it's large object components and each has its data component (I don't like using the term meta-data here - that implies that the data is used to describe the large objects rather than the large objects having the same rank as the rest of the data for an object.)

So how to make sense of it all. You need a data mining capability - a powerful data query language and tools to explore and navigate, to help give hints on how to mine the data.  My job will be simpler if the data schema maps onto my real world of CM/ALM data and my query language lets me express things in real world terms:  staff of <team member>, problems addressed by changes of <build record>, testcases of <requirement>.   Even better, it will let me aggregate these things:  testcases of <requirements tree members>, files affected by <changes>, etc.  Then I don't have to iterate over a set of items.

A good data mining capability requires:

    • high performance data engine
    • sophisticated query language
    • schema mapped closely to the real world objects
    • data summary presentation capability that can be used to zoom in to details from a higher level data summary

If I have a high performance data engine, I can work through the process without having to wait around and lose my train of thought.  Better yet, I can have point and click navigation tools that let me navigate data summaries instantly, crossing traceability links rapidly.  I can analyse the data between the right-click and the pop-up menu appearing, and select the appropriate actions to display.

The query language must let me work with data sets easily, including the ability to do boolean set algebra and to transform data from one domain to another along the traceability (or other reference) links (e.g. from "changes" to "file revisions").  In the CM world, I need to be able to do some special operations - that is operations that you may not normally find in a database. 

These include (1) taking a file (or file set) and identifying an ordered list of the revisions in the history of a particular revision, (2) taking a set of changes and identifying the list of affected files, (3) taking the root of a WBS or of a source code tree and expanding it to get all of the members, etc. The query language must also be sensitive to my context view.  A change to a header file might affect a file in one context, but not in a different context.  A next generation CM-based query language must be able to work with context views.  It must both work within a consistent view, and between views, such as when I want to identify the problems fixed between the customer's existing release and the one that we're about to ship to them.

And the schema must map closely to real world objects.  I'd like to say: give me the members of this directory, and not: go through all objects and find those objects whose parent is this directory.  I want to ask for the files of a change, not all files whose change number matches a particular change identifier.  I want to be able to ask for the files modified by the changes implementing this feature - not the more complex relational equivalent.  And I would prefer not to have to instruct the repository how to maintain inverted index lists so that the queries go rapidly.

A data summary capability should allow me to easily select and summarize the data. For example, graph problems for the current product and stream showing priority by status, and let me zoom into the most interesting areas at a click of the appropriate bar of the graph.

About the author

Joe Farah's picture
Joe Farah

Joe Farah is the President and CEO of Neuma Technology and is a regular contributor to the CM Journal. Prior to co-founding Neuma in 1990 and directing the development of CM+, Joe was Director of Software Architecture and Technology at Mitel, and in the 1970s a Development Manager at Nortel (Bell-Northern Research) where he developed the Program Library System (PLS) still heavily in use by Nortel's largest projects. A software developer since the late 1960s, Joe holds a B.A.Sc. degree in Engineering Science from the University of Toronto. You can contact Joe at farah@neuma.com