27 January 2012

The differences between a transactional database and an analytical one

Imagine the following scenraio:
A business decision has been made to create a BI project. After all the ETL process has been completed, and the decisions had been made about what's in and what out of the BI project; After all the data had been incorporated from several different systems (including the Excel spreadsheets in the "My Documents" folder); After the cube had been processed and reports had been created to please the business, somebody (else) from business comes with a question regarding one of the good old transactional systems, and expect the new flashy BI project to answer.  Now you need to come with a good reason why a "no can do" is actually the correct answer.
So here's an extract from an email I've sent regarding this issue. I've changed the names and some of the details to avoid identification, so the transactional database is called "DB" and the analytical database is called "BI". If you decide to cut & paste this post and send it to your business analyst, change the "DB" to the name of the particular database, and the same with the "BI".
More homework: check the DB versus the BI project. What are the questions that the DB is supposed to answer quickly and efficiently? What are the questions that the BI project is supposed to answer quickly and efficiently? Once done, put them down as "Examples" and spread them across you email.
Dear Business Analyst, 

Enclosed hereby please find a thorough information about DB, BI and the differences between them.

1.       Purpose:
a.       DB is a transactional database. It is meant to give you an answer for questions for the here & now. It is optimised for UPDATE, INSERT & DELETE operations.
b.      BI is an analytical database. It is optimised for GROUP BY (operations like SUM, AVG, COUNT etc.) and for reporting.
2.       Time:
a.       DB is a transactional database, and therefore it doesn’t retain historic information. While some data is being kept for auditing purposes, if you want to find out how many red bicycles were sold in the month of March 2011 you'd look in the AdventureWorks_DW, not in AdventureWorks.
b.     BI is an analytical database and therefore it maintains historic information across the time dimension. The question in (a) is considered pretty easy for BI querying.
c.       Granularity  – DB is changed constantly, BI is changed according to the frequency of the feed. While you can change the frequency of the feed, the changes are not being caught by the minute and we’ll always “miss” a change in order to take a snapshot of the bigger picture.
3.       Differences in values:
Theoretically, you’d expect that if you download the data from BI at a certain point in time and compare it to the DB then the values would be identical. Unfortunately, that is not true. It could because in the DB there are values that are incorrect; It could be because the DB keeps some fields for the purposes of answering questions that are relevant to its purpose, but are not being exported to the BI as they convey very little meaning in terms of the bigger picture. As the BI tries to incorporate data from more than one system that's quite likely.
In summary:
If you compare at a certain point in time the differences between DB and BI you’ll find them vast and you’ll probably get pretty frustrated at the differences. This comparison is useless.
What BI can give you (if we put the correct data inside) are changes and trends across time. BI isn’t meant for an exact snapshot at a certain point in time; that's the purpose of the DB. 

Sincerely Yours, 
Your BI developer

18 January 2012

Errors in the OLAP storage engine: The attribute key cannot be found when processing

The following error message had appeared while I was trying to create the "bridge table" that appeared so nicely in the Many to Many examples by Marco Russo & Alberto Ferrari:
Errors in the OLAP storage engine: The attribute key cannot be found when processing: Table: '
', Column: '', Value: ''. The attribute is ''. Errors in the OLAP storage engine: The attribute key was converted to an unknown member because the attribute key was not found. Attribute Client Group of Dimension:
from Database: , Cube: TestMtoM, Measure Group: Bridge Client Adviser A, Partition: v Bridge Client Adviser A, Record: 97.


The view:

CREATE VIEW [dbo].[vBridge_ClientAdviserA]
AS

SELECT 
     
    distinct

      [ClientGroup]

      ,[dimAdviserId]

  FROM [dbo].dimClient



As I've explained in my previous post I needed to create those "bridge tables" with the ClientGroup (that's the unique ID or "natural key" of the Client that was necessary for finding out the changes in the SCD) and the Adviser Id. The problem was with the linkage between the dimension and the measure group: SSAS must have the linkage (foreign key - primary key relationship) set up. If there's a violation then it stops and gives you this error.
Of course, I set up keys for indexing and that should have done that trick, correct? The problem was that I've created here a view for the bridging based on the source (dimension) table, which means that errors that would have otherwise been raised hadn't.
So where did the problem lie? In the simple fact that in my factless fact table (which is another view that refers to dimClient) there had been some "client group" missing. That's due to the fact that actually, though I was sure I had data integrity, it's been missing for some of the clients. I was sure that a certain structure of the data existed (that's what I was specifically told!) and I didn't bother to set up keys/checks/conditions to check it.
So I had to check it up (I did that using LEFT JOIN on the factless fact table, found out that some (it's enough to have one, but I believe there had been several cases) weren't strictly adhering to the business roles that were set up so carefully by my Business Analyst. As he kindly told me to "ignore it", I had, and created the bridge table as follows:

SELECT 
     
    distinct

      [ClientGroup]

      ,[dimAdviserId]

  FROM [dbo].factClient


 Please keep in mind that "fact Client" is a view that is based on dim Client, and picks up all the clients with a certain condition (or WHERE clause) and cross join them with the dimDate table.

Conclusion:
The Moral of the story is obviously: don't ever assume data integrity! Always check it manually either by a check constraint, a key, or a simple check. If you don't then SSAS would do it for you - and you know how descriptive those error messages are!

13 January 2012

How I've implemented a Many to Many relationship in SSAS


Figure 1 - the data source view.


A Many to Many relationship example in SSAS: an implementation with a difference.
It all started with a pretty innocent request:
in my work we have advisers and they serve clients.
Sometimes clients change across advisers, and we would like to keep track of that.
So, How do you do that? Or in technical terms:

How do you measure changes in measure across Slowly Changing Dimension?

(You can use Many To Many relationship for Distinct Count as well: more about it in here).
In my case there are actually two types of references to "clients". The first is the reference to "dimClient" which is the original Slowly Changing Dimension (SCD) and the second is the factless fact table: or, to be exact, a view that picks up the relevant clients from dimClient table and cross join it with the date table, and this way I have the changes per client per month.
As the clients sits in a factless fact table, I've googled "factless fact table", and before long (actually, on link # 7), I reached this invaluable article: The many to many revolution. This article was written by Marco Russo & Alberto Ferrari and it is brilliant. I'm not even trying to closely imitate that - I'm just giving a short example of how it worked in my cube, that's all.

So the problem:
There's the SCD (Slowly changing dimension) called dimClient. For all-sort-of-reasons (you can read about them in here if you're really interested) I've made a "factless" fact table out of it. Now the end user (or the business analyst, whatever you'd name him) likes to know what happened to different clients across the time. The first question referred to Advisers: clients move across advisers, how do you keep track of that?

The SQL part of the solution

After I've read the article closely, printed the relevant part (in this case: pages 63-69), and read it thoroughly several times, I created 3 new views:

The original view, factClient, became the measure group Adviser A.
A new view, called v_bridge_Adviser, with the following structure:

SELECT 
    distinct
        [dimDateId]
     
      ,[dimAdviserId]
      ,ClientGroup
  FROM [dbo].[factClient]



This view gives me another link between the ClientGroup (the natural key for every client), the Adviser id and the date id. This view became the measure group Adviser B.

Next two views are actually identical:


CREATE VIEW [dbo].[vBridge_ClientAdviserA]
AS

SELECT DISTINCT
      [ClientGroup]
      ,[dimAdviserId]

  FROM factClient


and


CREATE VIEW [dbo].[vBridge_ClientAdviserB]
AS

SELECT DISTINCT
      [ClientGroup]
      ,[dimAdviserId]

  FROM factClient



Brilliant!
The Data Modelling part of the solution
Now all I had to do is:
add the tables to the dsv
make sure that they are linked correctly
create the new measure group ("Adviser A", "Adviser B", "bridge Client Adviser A" and "bridge Client Adviser B")
process the data
find an incompatibility in the data source
sort it out (details in the next post)
And this is how the code looks like in the "measure group".





Figure 2 - the cube and the measures usage
  Please note:
Measure group: "Adviser A" is simply fact client
Measure group "Adviser B" is v_Bridge_Adviser
Measure group "Client Adviser A" is v_Bridge_Adviser A, and;
Measure group "Client Adviser B" is v_Bridge_Adviser B

The most important tip is:
Make sure that the dimensions and the measure groups are all linked correctly.
If you don't you'll get "weird" numbers that you'll have no way of linking to each other, and that's a very frustrating experience.
As the entire process is being done with dsv's and cubes, you need to be able to read the visual graphs better than reading code.
Another tip: (sound obvious, but still) : First link all the "regular" measure groups, only after that the "many to many" ones.

the Dimensions:
dimDate 1, dim date are the "dear old" dim dates dimensions. I had to re insert dim date again to the cube; no need to re define the dimension itself (hierarchy and all of that staff).
dim Adviser 1 and dim Adviser are the Advisers dimension referring to dim Adviser table from the dsv (see figure 1)
vBridgeClient 1 is vBridgeClient view (don't know why I had two in the dimensions, some things I found too difficult to deal with) gives you a list of the distinct clientGroup members (the natural key - there's only one natural key for clients in my DW). Getting the distinct number enables me not only to get the distinct count, but also of getting the info of who exactly had moved where; very important data, as it is SCD type 2).

And after all is set & done, this is how I've written a very complicated MDX to use it (crap, the user can do it by drag & drop himself!)

SELECT
[Adviser A].[Id].&[20] ON 0,
NON EMPTY [Adviser B].[Id].allmembers ON 1
FROM [TestMtoM]
WHERE ([Date A].[Id].[Aug 2011],
[Date B].[Year-Month].[Month Year].&[Oct 2011])




Then I was asked to do the same, but not with the "Adviser" dimension but with the Category dimension. That's a different dimension, but when it came to implementation, I realized something very important:
I don't need the "bridge Category A" and "Bridge Category B" measure groups!
They could be useful in Marco & Alberto's example; but not in mine. They refer to checking changes in a dimension linked to a fact table - I'm dealing with changes in a factless fact table. In my case, where there are way fewer data every month, and the view to measure link the advisers, clients & dates isn't large; the total amount of lines include only less than half a million rows - that's not a lot! (that's for a year and a half worth of data, when we don't need to retain history for more than 3 years).
Therefore, I deleted the "bridge Client Adviser A" and "Bridge Client Adviser B" measure groups, and this is how the end result looks like:

As you can clearly see, while we still need the many to many relationship, we don't need it in as many measure groups as we needed it as the complicated manner which had been used before.
Yes, it works brilliantly, and  yes, the data is accurate; it does give you the end results as it should!

One last thing: Yes, it is worth your while to set up the measure groups correctly. Of course you can write those complicated SQL queries, but why should you bother? This way your Business Analyst can create the reports him/herself to the best interest of the business without me worrying about the details of the display.