The problem with Distinct Count for SSAS 2008 R2 is that it is heavy to use. Of course you can use some methods to improve the performance, but nevertheless it is a heavy task. In my not-too-big data warehouse eliminating the measure altogether gave me 1 minute of benefit in the processing, which is a lot. So instead of going Measures -> New Measure... -> and choosing Distinct Count for the appropriate measure (in my case: the natural key of the Client, as Clients are a Slowly Changing Dimension).
I recommend reading the wonderful document The many to many revolution. This article was written by Marco Russo & Alberto Ferrari and I base this blog entry on their work, and obviously on what works for me. You can read an example in this blog entry, but I think he ignores Slowly Changing nature of the Slowly Changing Dimension.
So this is how it's done properly:
Distinct Count on SCD type 2 using Many to Many relationship
The SQL part
First, you need to create a view on the SQL server. This view bridge the Client and the Fact Table.Fact Table: FactCommission
Dim Table: DimClient
I don't want the (surrogate) key which links the Fact Commission to Dim Client: I want the Natural key!
Therefore I created a view as follows:
SELECT DISTINCT [Client Natural Key]
FROM dbo.factCommission
Since I can't create an Index on a view which uses DISTINCT, I had to change the view as follows:
CREATE VIEW [dbo].[v_bridge_fCommissionClient]
WITH SCHEMABINDING
AS
SELECT [Client Natural Key]
COUNT_BIG(*) AS totalNumber
FROM dbo.factCommission
GROUP BY [Client Natural Key]
I also created a Unique Clustered Index on the [Client Natural Key] field. I am not certain that this part is necessary when you use M2M, on the other hand I don't think it hurts...
The Data Modelling Part
After you add the v_bridge_fCommissionClient table to the data source view, and add it to the Cube Structure. First it is a Dimension, and then you create a measure out of it (when it is based as a fact).
The dimension has obviously only one attribute: [Client Natural Key]. When you add it as a dimension it is important to make sure the the link to the Fact Commission is based on the [Client Natural Key].
And now in order to make this work across the board you must link *all* of the relevant dimensions from Fact Commission to the new dimension using the Many To Many relationship. Many would advise you to link the Time dimension via the Fact Table using the Many To Many: true, but not enough. All the dimensions that are needed to be queried are queried via the Fact Commission measure. Therefore, you need to link them as well using Many To Many relationship (see below).
This way when you browse the data you get the numbers correctly for all dimensions.
Conclusion
I'm happy to say that:
- It does give you the correct data across the board just like the good old distinct count
- It works like a charm and there seems to be no overhead for the SSAS engine on the extra data. I believe this new Measure helps even helps the Distinct Count measure (but maybe it's just a belief).
- Once you get the hang of it, it is actually even simpler than using Distinct Count...
- Marco & Alberto claims that it is easier on the Caching. I'll have to see about that.
Hello,
ReplyDeleteThank you for post. I've applied your method, the measure calculation seems correct by when I drag with time dimension the value of current week shows the distinct count total of this week for each day of current week. The values of last week are correct.
Did you have an idea?
Thanks,
Best regards,
AG
Hi AG,
DeleteIf I understand correctly you have a situation in which you sum up by week. If you show
week 1
week 2
week 3
current week:
Monday,
Tuesday,
etc
Then the calculation shows what it asks of it (to show the calculation by date; it doesn't understand that if you ask "by week" and you show each day you want to see it "by day").
Other option is that you're missing a data linkage somewhere. But I am not sure of that.
Let me know if it works!
This comment has been removed by the author.
ReplyDeleteHi Hila DG,
ReplyDeleteThank you for your answer. I'm checking my data linkage and let you know. In parallel, I'm showing you the comparison between the distinct count from your method and the SSAS native discount count :
Week Day "Nb distinct Blog" "Nb native distinct count"
W24 13/6/2016 4148 4148
14/6/2016 4119 4119
15/6/2016 3920 3920
16/6/2016 3414 3414
17/6/2016 4040 4040
18/6/2016 2193 2193
19/6/2016 3 3
Total 21837 21837
W25 20/6/2016 12645 4080
21/6/2016 12104 4226
22/6/2016 12104 3733
23/6/2016 12104 606
Total 12645 12645
Total 576162 576162
Grand Total 576162 576162
As you see, on the week "W24", I have the same results, the measure is correct. On "W25", I have just the total of distinct count of the week (12645) in one day ( 20/6/2016) and other incorrect values.
The last process date of cube was 24/06/2016. Only values under the current week are wrong.
PS: "Nb distinct Blog" is the measure calculated with your method
Hi Hila,
ReplyDeleteI'm investigating the SQL server version used, on my development environment, I was on SQL server 2008 R2 RTM and deploy it in a SQL server 2008 R2 SP2 version. There is a lot of bugs correction after this RTM version. I'm installing a patch with the last release of SQL Server 2008 R2 and test it again.
Best regards,
AG