v0.1 of my Visit/PageView Analysis Services Cube

So I created a cube using my the visit/pageview that I recently started collecting.  The cube is VERY simple - only 4 dimensions {App, Page, Source, Time} and just a single measure - {Hit}.  I created the cube using the 2005 versions of Visual Studio and Microsoft's Analysis Services.  The IDE's wizards pretty much walk you through the process, which is great because creating an Analysis Services project from scratch is more than a little intimidating.  Especially if your a web developer like me and you don't know a whole lot about querying, let alone designing a cube.

Anyway, like I said, my cube is very simple.  The App dimension only contains 2 members: 'mattberseth' for this site, and 'mattberseth2' for my live demo site.  The Page dimension contains all of the unique URLs for both sites as its members, the Source dimension is essentially a bit field for determining if the traffic was the result of a direct hit or from a referring site and finally the Time dimension represents the calendar and is used for counting hits by a time interval (i.e. Days, Weeks, Months, etc...). 

In my simple cube, all 4 dimensions and the Hit measure are currently coming from a single table.  I have simulated the standard star schema by generating the keys for the different dimensions using the following 4 SQL queries.  So the first query in the SQL snippet populates my measure group and the other three are responsible for populating the App, Source and Page dimensions.

   1: -- builds the fact_pageview measure group ***
   2: select
   3:     -- if the referring url is empty than we know the source comes from a direct hit
   4:     (case when ref_url = '' then 1 else 0 end) as source_id,
   5:     -- the app_id - either mattberseth or mattberseth2
   6:     app_id,
   7:     -- turn the page url to an int
   8:     checksum(url) as page_id,
   9:     -- extract the date portion of the datetime
  10:     convert(datetime, floor(convert(float, date))) as date,
  11:     -- each row is a sinlge hit
  12:     1 as hit
  13: from
  14:     -- my visit table
  15:     visit_load with(nolock)
  16:  
  17: -- builds the app dimension ***
  18: select
  19:     'mattberseth' as app_id,
  20:     'My Blog' as app_name
  21: union
  22: select
  23:     'mattberseth2',
  24:     'My Live Demo Site'
  25:     
  26:  
  27: -- builds the source dimension ***
  28: select
  29:     0 as source_id,
  30:     'Referrer' as source_name
  31: union
  32: select
  33:     1,
  34:     'Direct'
  35:  
  36:  
  37: -- builds the page dimension ***
  38: select
  39:     checksum(url) as page_id,
  40:     url as url
  41: from
  42:     visit_load with(nolock)
  43:     

 

Next, I created views for these 4 queries, let my Analysis Services project know about them and used them as the data source for my cube.  Conceptually, this diagram shows how these 4 queries are related.

image

 

Browsing and Querying the Cube

And amazingly, only after a few minutes of nexting through wizards and drag and drop design work, I deployed and processed the cube to my local Analysis Server instance.  And now I can start taking a look at the data.

Browsing the Cube

Once the cube is deployed and processed, you can start browsing it.  Below are a couple of screen shots that show the structure of my cube on the left, and hit counts for my two sites (mattberseth.com and mattberseth2.com) segmented by the traffic source (either direct traffic or referring site).  The screen shot below that shows these counts as a percentage of the grand total.  Looks like direct traffic to my demo site only makes up 2% of my total traffic ;(

image

 

 image

 

Querying the Cube

And if you can stomach writing a little MDX, you can write custom queries to extract even more useful information.  Below is a sample MDX query and result set that shows the average traffic per day for both mattberseth.com and mattberseth2.com segmented by week.  The numbers are a little deceiving because only Week 39 consists of a full 7 days, but I think you can get the picture.

   1: with 
   2:     -- define the Weekend and Weekday sets
   3:     set [Weekday] as 
   4:     {
   5:         [Time].[Day Of Week].[Day 2], 
   6:         [Time].[Day Of Week].[Day 3], 
   7:         [Time].[Day Of Week].[Day 4], 
   8:         [Time].[Day Of Week].[Day 5], 
   9:         [Time].[Day Of Week].[Day 6]
  10:     }
  11:     set [Weekend] as 
  12:     {
  13:         [Time].[Day Of Week].[Day 1], 
  14:         [Time].[Day Of Week].[Day 7]
  15:     }
  16:     -- create a few calculated meausres based that make use of these sets
  17:     member [Measures].[Weekday Average] as avg([Weekday], [Measures].[Hit]), format_string = '#'
  18:     member [Measures].[Weekend Average] as avg([Weekend], [Measures].[Hit]), format_string = '#'    
  19:     member [Measures].[Weekly Average] as avg({[Weekday], [Weekend]}, [Measures].[Hit]), format_string = '#'    
  20: select
  21: {
  22:     [Measures].[Weekday Average],
  23:     [Measures].[Weekend Average],
  24:     [Measures].[Weekly Average]
  25: } on 0,
  26: non empty
  27: {
  28:     [App].children * [Time].[Week Of Year].children
  29: } on 1
  30: from
  31:     [PageView]

image  

 

What's Next?

Well, I am pretty excited.  I only have a handful of development hours invested in my visit cube (it honestly took longer to write this post than it did to create the cube) and I can already tell I have made the right decision by maintaining my own pageview/visit database.  Of course there is still a lot to do ...

  • My pageview JavaScript tracking code needs some work.  I have been tweaking it over the past 2 weeks to play around with different techniques to keep my tracking request from getting cached.  I have finally come within a few percent of what Google Analytics is reporting so I am happy.  I think I will clean the handler up and write a quick post describing what I did. 
  • My Source dimension on has 2 members - Direct and Referrer.  I would like to break down the Referrer further to include Search Engines, Community (DNK, Reddit, Digg, DZone, etc...), and forums (forums.asp.net, stackoverflow, etc...).
  • I need to extract keywords from the Search Engine sources and get them into the cube
  • Look up geography information based on IP
  • IP + User Agent Sessionization.  I would like to track time on site, navigation paths, etc...
  • Incorporate additional dimensional data from my Moveable Type database

 

At some point I plan on sharing the solution: JavaScript tracking code, HttpHandler, OLTP and OLAP databases as well as the Analysis Services Project ...

 

That's it.  Enjoy!


TrackBack

TrackBack URL for this entry:
http://mattberseth.com/blog-mt/mt-tb.fcgi/154

Comments


Posted by: Jakub on September 30, 2008 05:04 AM

Hi Matt,
I really enjoy your posts. It's incredible how you can easily use common technologies and tools to get excellent results. Can't wait to the next post. You rock!

Can't wait to see your solution: JavaScript tracking code, HttpHandler, OLTP and OLAP databases as well as the Analysis Services Project.

Keep up the good work!

You may like to reconsider using the IP address for geographic analysis. This is notoriously inaccurate.
You may find a way to use the localisation of the client machine using Javascript. This will be much more accurate.

Remember IP address is based on the location of the ISP and not the client. When using AOL all UK users were counted as US visitors.

How does google know which country you are in? Their method appears pretty accurate and permits them to serve ads according to geographical location, even down to state level.

Posted by: Ian Patrick Hughes on October 3, 2008 05:39 PM

I think I have a nerd crush.

Posted by: Simon K Chegeh on October 6, 2008 11:00 AM

When i grow up, i now know what i want to be!!! excellent work. keep it up

Hi Matt.
I'm the CEO of NuConomy.
We developed a next generation web analytics and also published a plugin for Wordpress (you can find it at http://wordpress.org/extend/plugins/nuconomy-insights/. We also have plugins for Community Server, Dasblog and Movable Type.

I'm writing about this because we use Analysis Services (with many cubes :)) on our backend.
I'll be happy to share with you the good and the very hard experiences we had to go through with it...

Posted by: some guy on October 25, 2008 08:30 PM

Why use JS instead just importing your log files to your db?

Keep up the good work!

I really enjoy your posts.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Consulting Services

Yep - I also offer consulting services. And heck, I'll do just about anything. If you enjoy my blog just drop me an email describing the work you need done.

Recent Comments

  • hekimboard wrote: I really enjoy your posts....
  • güvenlik wrote: Keep up the good work!...
  • some guy wrote: Why use JS instead just importing your log files to your db?...
  • Shahar Nechmad wrote: Hi Matt. I'm the CEO of NuConomy. We developed a next generation web analytics and also published a ...
  • Simon K Chegeh wrote: When i grow up, i now know what i want to be!!! excellent work. keep it up...
  • Ian Patrick Hughes wrote: I think I have a nerd crush....
  • ambercat wrote: You may like to reconsider using the IP address for geographic analysis. This is notoriously inaccu...
  • Asheesh Soni wrote: Can't wait to see your solution: JavaScript tracking code, HttpHandler, OLTP and OLAP databases as w...