MDX offers many powerful statistical functions, however they have their limitations.  One of the things I’ve had to learn this past year is what types of statistical functions are available, and what they can and can’t do.  And when they can’t do what I want, I’ve had to come up with alternatives.

One such case of this came up with calculating medians.  In case you need a reminder of what a median is, the median of a set of data is the datapoint that falls in the middle, dependent on some ordering.  If there are an odd number of datapoints, it’s the point in the exact middle, if it’s an even number of points, it’s the average of the two points in the middle.  For example, if you have the set of points {4,1,3,5,6} the median point (ordered by magnitude) is 4.  If the set was {4,1,3,5} then the median point is the average of 3 and 4, or 3.5.

There is a median function in MDX, with syntax Median(<<set>>, <<numeric expression>>) which would handle our above example easily.  However, what happens if we need the median of a set of something other than numbers.  Perhaps we want the median name ordered alphabetically.  Or in the case that sparked this point, what if we wanted the median date of an event?

We have to create our own calculation, based on what we know about medians. I will actually show and explain two ways of doing this.  The first way works regardless of the data type, so long as you have an ordering you wish to use.  The second way is specifically for date types (and is a bit more clever in my opinion), but you may be able to use the idea behind it to come up with other examples.

The nice thing about using dates is they are already ordered if you have your Date dimension built correctly, so we don’t even have to specify an ordering.  If you have to though, you can always choose to use Order() to order your set however you choose to.

I will be using the AdventureWorks 2008 cube.  Here is our query:

with
member [Measures].[MedianDate] as

Tail(
Head(
Nonempty(([Ship Date].[Calendar].[Date],
[Product].[Product Categories].CurrentMember)
, [Measures].[Order Count])
,Nonempty(([Ship Date].[Calendar].[Date],
[Product].[Product Categories].CurrentMember)
, [Measures].[Order Count]).Count/2+1)
).Item(0).Item(0).Name

select [Measures].[MedianDate] on columns,

Nonempty([Product].[Product Categories].Children
, [Measures].[Order Count]) on rows

from [Adventure Works]

Let’s dissect our calculation.  We have our set of Ship Dates, sliced by the Product Categories member that is on the row.  The set is already ordered since Dates are ordered in the cube already.  So which date, of this set do we want?  Well, we want the middle date.  So we count the number of data points and divide by 2.  We add the one because since we are using the Head() function, if our count is odd, our count divided by 2 will be .5 less than the ordinal of the middle number (I’ll leave this as an exercise to the reader 😉 ).  Since Head() just takes the integer part, if we didn’t add 1 we’d be taking the number one below the median number.  We then take the Tail() of this set in order to get down to one date.  The Item(0).Item(0).Name bit is just to get us down to the part we’re interested in displaying – the date.

The fun bit about using the Product Categories hierarchy is that you can change the level easily in the query and see how the median date changes with it.

Now, technically, the above calculation isn’t a TRUE median.  Do you see why?  We are taking the middle value, but only if the set is odd.  If the set is even, then there is no middle data point and we should take the average of the two points in the middle.  For the purposes of the case that inspired this blog, I made a decision that it made more sense to have the median be a data point within the set, which it is not guaranteed to be if the set is even.  However, while it is more code, it should be easy enough based on the above query to figure out how to find the other data point you would need if the set is even, and to take the “average” date of those two dates using the DateDiff() function and working from there.

Speaking of DateDiff(), this leads me to the second way we can calculate the median date.  The reason we didn’t use the Median() function in the first example is because the Median() function requires a numerical expression, which date isn’t.  But we can transform date into an integer that preserves the order of the dates, thanks to the DateDiff() function.

If we pick some arbitrary anchor date, then take the difference in date between that anchor point and our data point date, then we will get an integer and those integers will be in the same order as the dates.  For example, if our anchor point is 11/30/2009 and our set of data points is the dates in December 2009, then the difference between 12/1/2009 and 11/30/2009 is 1, the difference between 12/2/2009 and 11/30/2009 is 2, etc.  We can then apply the Median() function to that set of integers, then once we have that integer we can figure out what date it is by adding the median integer to our anchor point with DateAdd().

Here is our query:

with
member [Measures].[DateInt] as
DateDiff(“d”, CDate(“1/1/2007”), [Ship Date].[Calendar].CurrentMember.Name)

member [Measures].[MedianDate] as

DateAdd(“d”,CDate(“1/1/2007”),
Median(
Nonempty(([Ship Date].[Calendar].[Date],
[Product].[Product Categories].CurrentMember)
,[Measures].[Order Count])
,[Measures].[DateInt]))

select [Measures].[MedianDate] on columns,

Nonempty([Product].[Product Categories].Children
, [Measures].[Order Count]) on rows

from [Adventure Works]

Our first calculated measure, DateInt, gives us our integer, using the anchor point 1/1/2007.  Then, we take the same set we saw before in the first query, and take our median using the DatInt measure.  Then we add that number to our anchor point and voila!
If you run the above queries side by side you’ll notice they give you the same results except for on the Components row, which displays 12/24/2006 12:00:00 PM.  This is a dead giveaway that this set was even and the average of the middle dates resulted in something plus half a day.  A clear illustration of the difference between our true median calculation in this example, versus our first example where we cheated a little bit.