Archive pour 6 février 2008

OLAP architectures

Data staging

Most data in OLAP applications originates in other systems. However, in some applications (such as planning and budgeting), the data might be captured directly by the OLAP application. When the data comes from other applications, it is usually necessary for the active data to be stored in a separate, duplicated, form for the OLAP application. This may be referred to as a data warehouse or, more commonly today, as a data mart. For those not familiar with the reasons for this duplication, this is a summary of the main reasons:

Performance
OLAP applications are often large, but are nevertheless used for unpredictable interactive analysis. This requires that the data be accessed very rapidly, which usually dictates that it be kept in a separate, optimized structure which can be accessed without damaging the response from the operational systems.

Multiple data sources
Most OLAP applications require data sourced from multiple feeder systems, possibly including external sources and even desktop applications. The process of merging these multiple data feeds can be very complex, because the underlying systems probably use different coding systems and may also have different periodicities. For example, in a multinational company, it is rare for subsidiaries in different countries to use the same coding system for suppliers and customers, and they may well also use different ERP systems, particularly if the group has grown by acquisition.

Cleansing data
It is depressingly common for transaction systems to be full of erroneous data which needs to be ‘cleansed’ before it is ready to be analyzed. Apart from the small percentage of accidentally mis-coded data, there will also be examples of optional fields that have not been completed. For example, many companies would like to analyze their business in terms of their customers’ vertical markets. This requires that each customer (or even each sale) be assigned an industry code; however, this takes a certain amount of effort on the part of those entering the data, for which they get little return, so they are likely, at the very least, to cut corners. There may even be deliberate distortion of the data if sales people are rewarded more for some sales than others: they will certainly respond to this direct temptation by ‘adjusting’ (ie distorting) the data to their own advantage if they think they can get away with it.

Adjusting data
There are many reasons why data may need adjusting before it can be used for analysis. In order that this can be done without affecting the transaction systems, the OLAP data needs to be kept separate. Examples of reasons for adjusting the data include:

    • Foreign subsidiaries may operate under different accounting conventions or have different year-ends, so the data may need modifying before it can be used.
    • The source data may be in multiple currencies that must be translated.
    • The management, operational and legal structures of a company may be different.
    • The source applications may use differenrt codes for products and customers.
    • Inter-company trading effects may need to be eliminated, perhaps to measure true added value at each stage of trading.
    • Some data may need obscuring or changing for reasons of confidentiality.
    • There may be analysis dimensions that are not part of the operational data (such as vertical markets, television advertising regions or demographic characteristics).

Timing
If the data in an OLAP application comes from multiple feeder systems, it is very likely that they are updated on different cycles. At any one time, therefore, the feeder applications may be at different stages of update. For example, the month-end updates may be complete in one system, but not in another and a third system may be updated on a weekly cycle. In order that the analysis is based on consistent data, the data needs to be staged, within a data warehouse or directly in an OLAP database.

History
The majority of OLAP applications include time as a dimension, and many useful results are obtained from time series analysis. But for this to be useful it may be necessary to hold several years’ data on-line in this way — something that the operational systems feeding the OLAP application are very unlikely to do. This requires an initial effort to locate the historical data, and usually to adjust it because of changes in organizational and product structures. The resulting data is then held in the OLAP database.

Summaries
Operational data is necessarily very detailed, but most decision-making activities require a much higher level view. In the interests of efficiency, it is usually necessary to store merged, adjusted information at summary level, and this would not be feasible in a transaction processing system.

Data Updating
If the application allows users to alter or input data, it is obviously essential that the application has its own separate database that does not over-write the ‘official’ operational data.

Continuer la lecture ‘OLAP architectures’