26 October 2011

Recursive Documented MDX

The challange:

I was asked by my business analyst to group the end of month results like this:
FUM
Number of clients (CG ID)
Sum of FUM
Up to $100,000


$100,001 - $250,000


$250,001 - $500,000


$500,001 - $1,000,000


$1,000,001 plus


Summarise the value of the portfolio per client and group them according to their values. In a way it means to give a dimension value to the measures. That has to be done, of course, while not losing the granuality of the data.

The data, obviously, looks like this:
Client id, Security id, Security value, date id

(and we insert the data on a monthly basis).
In order to achieve this I've googled, and googled, and googled some more, and reached the following links:
First, there is a term called Recursive MDX, and thanks to Mosha's excellent work you can read about it. Another thanks are due to Richard Lee, who explains when not to use it (hint: there are plenty of date functions in MDX that exist right now, so please learn them if you need to summarise "by month up till now"). I also read his post regarding Group by Measure Range Dynamically and the code below is based on his. The only difference you'll find below is that I've added much needed documentation, as I wasn't even sure about what I was doing (a hint: lots of copy, paste, try & error...)So here's the code, and below please find the documentation:



WITH
MEMBER Measures.FumRange  AS
CASE
    WHEN [Measures].[FUM Value] <= 0 THEN 'Negative'
   WHEN [Measures].[FUM Value] > 0 AND  [Measures].[FUM Value]<=100000 THEN '0-100K'
   WHEN [Measures].[FUM Value] >100000 AND [Measures].[FUM Value] <=250000 THEN '100K-250K'
   WHEN [Measures].[FUM Value]> 250000 AND [Measures].[FUM Value] <= 500000 THEN '250K-500K'
   WHEN [Measures].[FUM Value] > 500000 AND [Measures].[FUM Value] <= 1000000 THEN '500K-1M'
   WHEN [Measures].[FUM Value] > 1000000 THEN 'Over 1M'
END   
SET [OrderedFUM] AS (
    FILTER(   
    ORDER (([Dim Portfolio Client].[Dim Client Group Id].children,
        [Dim Date].[Id].&[20110630]),
        [Measures].[FUM Value], BASC),
        ISEMPTY ([Measures].[FUM Value] )=0)
       
    )   

MEMBER Measures.FumRank AS RANK (([Dim Portfolio Client].[Dim Client Group Id].currentmember,
                                    [Dim Date].[Id].&[20110630]),
                                    [OrderedFum])
                                   
MEMBER MEASURES.RangeTot AS
    IIF((OrderedFUM.item(Measures.FumRank-2), Measures.FumRange)= Measures.FumRange,
             Measures.[FUM Value] + ( Measures.RangeTot ,OrderedFUM.item(Measures.FumRank-2)),
             Measures.[FUM Value])
   
MEMBER MEASURES.RangeCountCG  AS
    IIF((OrderedFUM.item(Measures.FumRank-2), Measures.FumRange)= Measures.FumRange,
            1 + (Measures.RangeCountCG, OrderedFUM.item(Measures.FumRank-2)),
            1)
MEMBER MEASURES.RangeTotal AS
    IIF((OrderedFUM.item(measures.FumRank),Measures.FumRange)=Measures.FumRange,
        NULL,
        Measures.RangeTot),
    FORMAT_STRING = "$#,##0.00;-$#,##0.00"                
   
   
SELECT {[Measures].FumRange, Measures.RangeTotal, Measures.RangeCountCG} ON 0,
NONEMPTY(OrderedFUM, Measures.RangeTotal) ON 1                                   
FROM [xplan DWH]


So, what's going on in here?
(The definitions are from Kevin S. Goff Article: The Baker's Dozen. I recommend reading it as he writes clearly and the information is invaluable)
First of all, we're defining our own measures, and that's why I had to begin with

WITH

keyword, tells that this is what comes below is the list of measures to use.
Second, the measures are:
A list of the values to group by, a "CASE" statement: the same in MDX as it would be in SQL.


MEMBER Measures.FumRange  AS
CASE
    WHEN [Measures].[FUM Value] <= 0 THEN 'Negative'
   WHEN [Measures].[FUM Value] > 0 AND  [Measures].[FUM Value]<=100000 THEN '0-100K'
   WHEN [Measures].[FUM Value] >100000 AND [Measures].[FUM Value] <=250000 THEN '100K-250K'
   WHEN [Measures].[FUM Value]> 250000 AND [Measures].[FUM Value] <= 500000 THEN '250K-500K'
   WHEN [Measures].[FUM Value] > 500000 AND [Measures].[FUM Value] <= 1000000 THEN '500K-1M'
   WHEN [Measures].[FUM Value] > 1000000 THEN 'Over 1M'
END   



Then I need to define the SET. The Set statement 

SET [OrderedFUM] AS (
    FILTER(   
    ORDER (([Dim Portfolio Client].[Dim Client Group Id].children,
        [Dim Date].[Id].&[20110630]),
        [Measures].[FUM Value], BASC),
        ISEMPTY ([Measures].[FUM Value] )=0)
       
    )     


enables me to group all the Sec Value (we use the term FUM for this measure for some reasons) by client id according to the month of June. Not all the clients have FUM value every month, and that's why I FILTER to check if there is value, using the ISEMPTY keyword. I order BASC by the measures and please note that
[Dim Portfolio Client].[Dim Client Group Id].children
is Actually equivalent to
[Dim Portfolio Client].[Dim Client Group Id]
I added it for readability only. 

Now I needed to Rank every member of the set according to their value, using the RANK Keyword.

  MEMBER Measures.FumRank AS RANK (([Dim Portfolio Client].[Dim Client Group Id].currentmember,
                                    [Dim Date].[Id].&[20110630]),
                                    [OrderedFum])



And now the real fun begin.
First of all, I need to summarize all the members of a certain set that falls in a certain category, as defined in the first measure. This is where the recursive MDX is defined:
MEMBER MEASURES.RangeTot AS
    IIF((OrderedFUM.item(Measures.FumRank-2), Measures.FumRange)= Measures.FumRange,
             Measures.[FUM Value] + ( Measures.RangeTot ,OrderedFUM.item(Measures.FumRank-2)),
             Measures.[FUM Value])



Or, in English:
The Measure "RangeTot" gives you for each member the previous [FUM VALUE] in the  measures. The IIF statement says: If the the measure falls within the FUM range, then add it to the previous one (note the recursion defined), if it doesn't fall than this is our stopping point (the false condition).

But if you try to run it you wouldn't get much, you actually need to call this measure using
MEMBER MEASURES.RangeTotal AS
    IIF((OrderedFUM.item(measures.FumRank),Measures.FumRange)=Measures.FumRange,
        NULL,
        Measures.RangeTot),
    FORMAT_STRING = "$#,##0.00;-$#,##0.00"                 



And I've formatted the string to show "currency" as give it much required readability.
The same recursion is required in order to count the number of clients in the client group, and this is done like this:

MEMBER MEASURES.RangeCountCG  AS
    IIF((OrderedFUM.item(Measures.FumRank-2), Measures.FumRange)= Measures.FumRange,
            1 + (Measures.RangeCountCG, OrderedFUM.item(Measures.FumRank-2)),
            1)

Once again, the condition for stopping is for not being in the range, then you just add one, otherwise continue counting.

And now in order to display the data:

SELECT {[Measures].FumRange, Measures.RangeTotal, Measures.RangeCountCG} ON 0,
NONEMPTY(OrderedFUM, Measures.RangeTotal) ON 1                                   
FROM [xplan DWH]



The measures RangeTotal must appear both on Rows and on Columns (0 and 1 Axys), otherwise it isn't summed up. I could have used Measures.RangeTot instead of RangeTotal ON 0, but that wouldn't have been formatter. Of course, there's no need to "format" a recursive expression and yes, this code works and displays the data as it should!

Comments:
  1. Write the same code in SQL for a certain month just in order to validate your data; I love the way you can give more power to the business analyst by well defined measures in MDX
  2. I'll write in my next post about how to parameterise the month and how to display the data.

18 October 2011

Dimension Relationship in a Recursive Hierarchy

I'm now working on a data warehouse project with SSAS 2008 R2 (Sql Server Analysis Services). It is very refreshing to do something different after doing MS-Access for so long.

While designing the dimensions in Analysis Services I had the following problem: working in a Financial Company, every client has a client group. Sometimes it is the same client group as the client, and sometimes it isn't. Pretty classic material:
John & Jane Smith (account id: 1) and John Smith (account id 2) both belong to the same "client group" of John & Jane Smith. So far nothing special: the dimClient table has a self referencing attribute to dimClient with the id of dimClientGroupId.  The problem arises when I need to group by properties of the client group, and not by the client itself. Sometimes the adviser for "dimClient" is different then the adviser of "dimClientGroup", and the business needs to view the data by the Client Group, and not by the Client.
So I have quariable attributes that are different in different parent-child levels and I need to present the data properly. I tried using the "client group adviser" as an attribute to the client but it didn't work. It also didn't group the data correctly.

So, how do I use the attributes of the parent dimension in a self-referencing dimension?(Please note: using "self referencing" is better than "parent-child", because then you get to all sort of parenting material; useful, but not what I'm looking for right now!)

Generally, it is advised against using too many parent-child hierarchies in Analysis Services, unless "there's a genuine business requirement" (which happens to be the case).

So, after designing all the dimensions I had to link the "adviser" dimension to the "client" dimension. That was easy (and maybe even done automatically) as every client has an adviser. Now I had to link the "adviser" dimension to the "client group" dimension and this is where I got stuck (after all, "client group" is a self-referencing dimension and is treated by SSAS differently!).
My solution has two levels. I might change between them in the future, as there are no guarantees that my first level will continue to work.

Using "referenced dimensions" I linked the "adviser" dimension to "client".
Now, I've linked the "client" dimension to "client" again and - voila! - I have the "client group" dimension and I can view in the cube the "client group.adviser" dimension and filter by it. A warning: trying to deploy the cube generates an error: "Errors in the metadata manager. The 'Dim Client Group Id' intermediate granularity attribute of the 'Dim Client' measure group dimension does not have an attribute hierarchy enabled.". Nevertheless the cube can still be processed and view from the SSMS - Analysis services, and the numbers are correct.
There is a different solutions to the same problem that is suggested in here (under the "parent-child Hierarchies" short section):
As the numbers of Hierarchies in this business model is fixed (every client has one and only one client group), Creating a bridge table of "portfolio client" which includes only the client and the client group (both are foreign keys to the "dim client" table) and again, using it as a reference dimension as follwos:
[fact]portfoio -> [dim]portfolio client ->[dim] Client Group ->[dim]Adviser sorts this problem in an even clearer manner and without ugly error messages. A good explanation about bridge table could be found in here (they bought me with the term "Graph Theory").

10 October 2011

Change a dimension to a fact

Life don't always follow the "best practicing" methodologies of Life. "Star Schema" is the best and most preferred way to design the data warehouse; unfortunately, it's not always possible. In the industry I'm at (financial planning) you would expect the measures to be some financial product valuable (i.e. money); so the major feed is actually clients and their financial products. The portfolio is the "fact" table (easy to decide by the size of it) while the clients and their attributes are the "dimension" table. Sounds OK, until they asked for the number of clients. Easy enough, you'd say, every client must have a representation in the "portfolio" fact table, so all we have to do is to use "distinct count" on the "client id" in the fact table, right? Wrong. Some clients have financial products, but not in the feed of the portfolio but in some other file which hasn't been created yet. Again, before you scream "bad design!" may I add that the original intention was to create a "source of truth" which lays in an outsourced system, and only later to create the date warehouse itself?
So here I am, trying to create count of clients per month per adviser, or in DW/BI terms: trying to use a dimension table of a slowly moving dimension (that has the "date" only as time attributes) as a fact table (which would have a link to the date as a dimension). Or, as my Business Analyst put it: Before, the "portfolio" was in the center of what we want. Now, the center of attention is the client.
The solution?

Using indexed view from dimClient, created a view called "factClient". The view takes only the necessary fields from the dimClient table (no "date inserted" and no "valid till" - which is defaulted to 31 of December, 9999) and joins them to the dimDate table.
The view is created like this:


SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

SET CONCAT_NULL_YIELDS_NULL ON
GO

SET ANSI_WARNINGS ON
GO




CREATE VIEW [dbo].[factClient]
WITH SCHEMABINDING
AS
SELECT d.[id] AS dimDateId
      ,1 AS ClientGroupCount
        ,C.[id]

      ,C.dimAdviserId
      -- Other dimension fields and other necessary fields
      --
      --
  FROM [dbo].[dimDate] d INNER JOIN
    [dbo].[dimClient] C ON c.DateInserted <= d.DateFormat AND c.ValidTill > d.DateFormat
  WHERE d.HasPast <> 0






As you can see I deleted (for the sake of this view) the columns that are particular to my database, but what I've kept are the followings:
First, I've added a field in the Date dimension table that relates weather this data has past or not. This fields get updated after the ETL process completes. The reason for that is that I didn't want to inner join with the (large) fact Portfolio table, and the other is that any other way of querying that client would result with an error of "Cannot create index on view 'dbo.factClient'. The function 'getdate' yields nondeterministic results. Use a deterministic system function, or modify the user-defined function to return deterministic results."
The second thing that happens is that it gives you the user id, and a dimension, like this:
Member Id   dimDateID      dimAdviser Id    Count
1                  1 Dec 2010     1                            1
1                  2 Dec 2010     2                            1
......
etc,
Which gives me the ability to query by adviser on a different date dimension. Giving every line the value of "1" enables me to "sum" without using a "distinct count" (or any count, for that matter) as distinct count requires a unique measure group.
Please note that I'm using SQL Server 2008 R2 which enables me to use the "date" data type instead of "datetime" datatype.
Please note that the "dateFormat" field in the "dimDate" table is in order to give the "date" version of the date id: so when the key is 20101130 (data type: int) the "date format" is '20101130' (data type: date).
(More about "Indexed Views" in here)

Now there's a new fact table the following stages should be made in the Analysis Services project:
Add the new fact table to the DSV (data source view) file
Create the necessary relationships between the fact table and the dimensions surrounding it
Create a new measure group in the cube. Make sure that the measures are defined correctly (the [Member id] field  is an INT, but I don't want it to be summed!)
Hand it over to your Business Analyst and hope that s/he approves...