A reasonable sized enterprise ($500 million to $1 billion annual sales), where there is a application-centric database design strategy across all its functional areas, has about 200 application-centric databases that would include involve at a very minimum:
- Customer Information
- Sales and ordering Information
- Marketing Information
- Human Resources Information
- Finance (GL, AP, AR, etc) Information
- Manufacturing, Inventory Information
- Warehouse Information
- Distribution Information
- Middle management Information
- Senior Management Information
- Strategic, Tactical and Operational Planning
The common overlap among these data models is about 20 percent. That is, about 20 of every 100 tables is somewhat to exactly the same as tables in one or more other databases. Because of these duplications,"data harmony" can exist only when all the following statements are completely true individually and in unison:
- All the columns of the duplicated tables are exactly alike, and
- The granularity of the duplicated table rows are exactly the same, and
- The business events or time slices when the data is re-created/re-collected are exactly the same, and
- The semantic meanings, definitions, data types and value domains are exactly alike.
Alternatively, if any or all the "data harmony" statements above are false, the problems associated with data dis-integration though tedious, error-prone, fragile, and expensive can be managed. Collectively, successful data dis-integration management is called "data governance." Otherwise it is just called "chaos."
There are three solutions employed for these kinds of data overlaps/duplication. The first is to re-collect/re-create the data in each of these common tables such that all the "data harmony" statements are true, and the second is to create an ETL (Extract-Transform-Load) application system environment that extracts the data from one database, transforms it, and loads it into another. Both solutions have real and obvious challenges.
The first solution, re-collection and re-creation, requires that every data collection system has to be modified to continuously store data in multiple application databases whenever that data is either collected or updated. Such application efforts cause application information system bloat, increased architecture, design, implementation, testing, and maintenance. Such systems require continuous evolution as there are ever more databases to deal with.
The second solution is ETL. That is, Extract-Transform-Load. These are a kind of business information systems in their own right, one for each different instance of the process of Extracting, Transforming, and Loading. Companies abound that address ETL with software packages/systems that regularize, standardize, and coordinate all the executions of the various ETL processes throughout the enterprise.
To give dimension to this data interchange problem, consider that if there are the 200 application-centric databases, and there is a need to share data across 20% of them. The quantity of interfaces for either of either solution is computed to be: ( N * (N-1)/2 ) * 0.2.
In this case, N = 200. The quantity of data-exchange interfaces is thus, 3,980.
The first alternative, re-collection and re-creation requires that each business information system touched by one or more of these 3980 interfaces has to be modified to get the redundant data from an authoritative source business information system and store it into the interfaced system. Each such business information system modification becomes a mini-maintenance cycle that requires its own requirements through user-acceptance cycle of phases. If the interface affects a single table, the cost of the modification is $16,000. That's 80 function points for the table with a per function point cost of $200. Over the 3880 such mini-business information system modifications, the overall cost is $63 million.
Under the ETL alternative, if each ETL interface need requires only ONE staff week to establish an ETL process through the application of a very sophisticated ETL tool set (e.g., requirements, architecture, design, construction, testing, and documentation), and then, over 5 years, another staff week to maintain it, the cost is almost $32 million. That represents the full time staff of about 30 (3980 interfaces * 80 hours per interface / 5 years / 2080 staff hours per year).
Regardless of the increased efficiencies from either the re-collection/re-creation solution or the ETL solution, the issue is very large, and the solution very expensive.
The third solution is the most beneficial solution. It is to not have the problem in the first place. That requires an overall data architecture, and prior to that a data architecture reference model that enables the problem to either never be created in the first place, or that provides a pathway to eliminate the problem altogether over time.
|