15 January 2014

Distinct Count using Many to Many relationship - SSAS

I've blogged here about implementing a many to many relationship. That was in order to view changes in a dimension across time (like Clients changing Advisers between one month to another). This time I want to blog about implementing the same technique in order to view the distinct count of Clients, as obviously it is a very important measure for an advisory company.
The problem with Distinct Count for SSAS 2008 R2 is that it is heavy to use. Of course you can use some methods to improve the performance, but nevertheless it is a heavy task. In my not-too-big data warehouse eliminating the measure altogether gave me 1 minute of benefit in the processing, which is a lot. So instead of going Measures -> New Measure... -> and choosing Distinct Count for the appropriate measure (in my case: the natural key of the Client, as Clients are a Slowly Changing Dimension).
I recommend reading the wonderful document The many to many revolution. This article was written by Marco Russo & Alberto Ferrari and I base this blog entry on their work, and obviously on what works for me. You can read an example in this blog entry, but I think he ignores Slowly Changing nature of the Slowly Changing Dimension.
So this is how it's done properly:

Distinct Count on SCD type 2 using Many to Many relationship

The SQL part

First, you need to create a view on the SQL server. This view bridge the Client and the Fact Table.
Fact Table: FactCommission
Dim Table: DimClient
I don't want the (surrogate) key which links the Fact Commission to Dim Client: I want the Natural key!
Therefore I created a view as follows:

SELECT DISTINCT [Client Natural Key]
FROM         dbo.factCommission

Since I can't create an Index on a view which uses DISTINCT, I had to change the view as follows:

CREATE VIEW [dbo].[v_bridge_fCommissionClient]
WITH SCHEMABINDING
AS
SELECT  [Client Natural Key]
COUNT_BIG(*) AS totalNumber
FROM         dbo.factCommission
GROUP BY [Client Natural Key]


I also created a Unique Clustered Index on the [Client Natural Key] field. I am not certain that this part is necessary when you use M2M, on the other hand I don't think it hurts...

The Data Modelling Part

After you add the v_bridge_fCommissionClient table to the data source view, and add it to the Cube Structure. First it is a Dimension, and then you create a measure out of it (when it is based as a fact).
The dimension has obviously only one attribute: [Client Natural Key]. When you add it as a dimension it is important to make sure the the link to the Fact Commission is based on the [Client Natural Key].
And now in order to make this work across the board you must link *all* of the relevant dimensions from Fact Commission to the new dimension using the Many To Many relationship. Many would advise you to link the Time dimension via the Fact Table using the Many To Many: true, but not enough. All the dimensions that are needed to be queried are queried via the Fact Commission measure. Therefore, you need to link them as well using Many To Many relationship (see below). 


This way when you browse the data you get the numbers correctly for all dimensions.

Conclusion

I'm happy to say that:
  1. It does give you the correct data across the board just like the good old distinct count 
  2. It works like a charm and there seems to be no overhead for the SSAS engine on the extra data. I believe this new Measure helps even helps the Distinct Count measure (but maybe it's just a belief).
  3. Once you get the hang of it, it is actually even simpler than using Distinct Count...
  4. Marco & Alberto claims that it is easier on the Caching. I'll have to see about that.








5 comments:

  1. Hello,
    Thank you for post. I've applied your method, the measure calculation seems correct by when I drag with time dimension the value of current week shows the distinct count total of this week for each day of current week. The values of last week are correct.
    Did you have an idea?
    Thanks,
    Best regards,
    AG

    ReplyDelete
    Replies
    1. Hi AG,
      If I understand correctly you have a situation in which you sum up by week. If you show
      week 1
      week 2
      week 3
      current week:
      Monday,
      Tuesday,
      etc
      Then the calculation shows what it asks of it (to show the calculation by date; it doesn't understand that if you ask "by week" and you show each day you want to see it "by day").
      Other option is that you're missing a data linkage somewhere. But I am not sure of that.
      Let me know if it works!

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi Hila DG,
    Thank you for your answer. I'm checking my data linkage and let you know. In parallel, I'm showing you the comparison between the distinct count from your method and the SSAS native discount count :

    Week Day "Nb distinct Blog" "Nb native distinct count"
    W24 13/6/2016 4148 4148
    14/6/2016 4119 4119
    15/6/2016 3920 3920
    16/6/2016 3414 3414
    17/6/2016 4040 4040
    18/6/2016 2193 2193
    19/6/2016 3 3
    Total 21837 21837
    W25 20/6/2016 12645 4080
    21/6/2016 12104 4226
    22/6/2016 12104 3733
    23/6/2016 12104 606
    Total 12645 12645
    Total 576162 576162
    Grand Total 576162 576162


    As you see, on the week "W24", I have the same results, the measure is correct. On "W25", I have just the total of distinct count of the week (12645) in one day ( 20/6/2016) and other incorrect values.
    The last process date of cube was 24/06/2016. Only values under the current week are wrong.

    PS: "Nb distinct Blog" is the measure calculated with your method

    ReplyDelete
  4. Hi Hila,
    I'm investigating the SQL server version used, on my development environment, I was on SQL server 2008 R2 RTM and deploy it in a SQL server 2008 R2 SP2 version. There is a lot of bugs correction after this RTM version. I'm installing a patch with the last release of SQL Server 2008 R2 and test it again.

    Best regards,
    AG

    ReplyDelete