OLAP architectures

Data staging

Most data in OLAP applications originates in other systems. However, in some applications (such as planning and budgeting), the data might be captured directly by the OLAP application. When the data comes from other applications, it is usually necessary for the active data to be stored in a separate, duplicated, form for the OLAP application. This may be referred to as a data warehouse or, more commonly today, as a data mart. For those not familiar with the reasons for this duplication, this is a summary of the main reasons:

Performance
OLAP applications are often large, but are nevertheless used for unpredictable interactive analysis. This requires that the data be accessed very rapidly, which usually dictates that it be kept in a separate, optimized structure which can be accessed without damaging the response from the operational systems.

Multiple data sources
Most OLAP applications require data sourced from multiple feeder systems, possibly including external sources and even desktop applications. The process of merging these multiple data feeds can be very complex, because the underlying systems probably use different coding systems and may also have different periodicities. For example, in a multinational company, it is rare for subsidiaries in different countries to use the same coding system for suppliers and customers, and they may well also use different ERP systems, particularly if the group has grown by acquisition.

Cleansing data
It is depressingly common for transaction systems to be full of erroneous data which needs to be ‘cleansed’ before it is ready to be analyzed. Apart from the small percentage of accidentally mis-coded data, there will also be examples of optional fields that have not been completed. For example, many companies would like to analyze their business in terms of their customers’ vertical markets. This requires that each customer (or even each sale) be assigned an industry code; however, this takes a certain amount of effort on the part of those entering the data, for which they get little return, so they are likely, at the very least, to cut corners. There may even be deliberate distortion of the data if sales people are rewarded more for some sales than others: they will certainly respond to this direct temptation by ‘adjusting’ (ie distorting) the data to their own advantage if they think they can get away with it.

Adjusting data
There are many reasons why data may need adjusting before it can be used for analysis. In order that this can be done without affecting the transaction systems, the OLAP data needs to be kept separate. Examples of reasons for adjusting the data include:

    • Foreign subsidiaries may operate under different accounting conventions or have different year-ends, so the data may need modifying before it can be used.
    • The source data may be in multiple currencies that must be translated.
    • The management, operational and legal structures of a company may be different.
    • The source applications may use differenrt codes for products and customers.
    • Inter-company trading effects may need to be eliminated, perhaps to measure true added value at each stage of trading.
    • Some data may need obscuring or changing for reasons of confidentiality.
    • There may be analysis dimensions that are not part of the operational data (such as vertical markets, television advertising regions or demographic characteristics).

Timing
If the data in an OLAP application comes from multiple feeder systems, it is very likely that they are updated on different cycles. At any one time, therefore, the feeder applications may be at different stages of update. For example, the month-end updates may be complete in one system, but not in another and a third system may be updated on a weekly cycle. In order that the analysis is based on consistent data, the data needs to be staged, within a data warehouse or directly in an OLAP database.

History
The majority of OLAP applications include time as a dimension, and many useful results are obtained from time series analysis. But for this to be useful it may be necessary to hold several years’ data on-line in this way — something that the operational systems feeding the OLAP application are very unlikely to do. This requires an initial effort to locate the historical data, and usually to adjust it because of changes in organizational and product structures. The resulting data is then held in the OLAP database.

Summaries
Operational data is necessarily very detailed, but most decision-making activities require a much higher level view. In the interests of efficiency, it is usually necessary to store merged, adjusted information at summary level, and this would not be feasible in a transaction processing system.

Data Updating
If the application allows users to alter or input data, it is obviously essential that the application has its own separate database that does not over-write the ‘official’ operational data.

Storing active OLAP data

Given the necessity to store active OLAP data in an efficient, duplicated form, there are essentially three options. Many products can use more than one of these, sometimes simultaneously. Note that ’store’ in this context means holding the data in a persistent form (for at least the duration of a session, and often shared between users), not simply for the time required to process a single query.

Relational database
This is an obvious choice, particularly if the data is sourced from an RDBMS (either because a data warehouse has been implemented using an RDBMS or because the operational systems themselves hold their data in an RDBMS). In most cases, the data would be stored in a denormalized structure such as a star schema, or one of its variants, such as snowflake; a normalized database would not be appropriate for performance and other reasons. Often, summary data will be held in aggregate tables.

Multidimensional database
In this case, the active data is stored in a multidimensional database on a server. It may include data extracted and summarized from legacy systems or relational databases and from end-users. In most cases, the database is stored on disk, but some products allow RAM based multidimensional data structures for greater performance. It is usually possible (and sometimes compulsory) for aggregates and other calculated items to be pre-computed and the results stored in some form of array structure. In a few cases, the multidimensional database allows concurrent multi-user read-write access, but this is unusual; many products allow single-write/multi-read access, while the rest are limited to read-only access.

Client-based files
In this case, relatively small extracts of data are held on client machines. They may be distributed in advance, or created on demand (possibly via the Web). As with multidimensional databases on the server, active data may be held on disk or in RAM, and some products allow only read access.

These three locations have different capacities, and they are arranged in descending order. They also have different performance characteristics, with relational databases being a great deal slower than the other two options.

 

Processing OLAP data

Just as there are three possible locations for OLAP data, exactly the same three options are available for processing the data. As will be seen, the multidimensional calculations do not need to occur in the place where the data is stored.

SQL
This is far from being an obvious choice to perform complex multidimensional calculations, even if the live OLAP data is stored in an RDBMS. SQL does not have the ability to perform multidimensional calculations in single statements, and complex multi-pass SQL is necessary to achieve more than the most trivial multidimensional functionality. Nevertheless, this has not stopped vendors from trying. In most cases, they do a limited range of suitable calculations in SQL, with the results then being used as input by a multidimensional engine, which does most of the work, either on the client or in a mid-tier server. There may also be a RAM resident cache which can hold data used in more than one query: this improves response dramatically.

Multidimensional server engine
This is an obvious and popular place to perform multidimensional calculations in client/server OLAP applications, and it is used in many products. Performance is usually good, because the engine and the database can be optimized to work together, and the availability of plenty of memory on a server can mean that large scale array calculations can be performed very efficiently.

Client multidimensional engine
On the assumption that most users have relatively powerful PCs, many vendors aim to take advantage of this power to perform some, or most, of the multidimensional calculations. With the expected rise in popularity of thin clients, vendors with this architecture are having to move most of the client based processing to new Web application servers.

Source

0 Réponses vers “OLAP architectures”


  1. Pas de commentaire

Laisser un commentaire