v0.2 of my Visit/PageView Cube - Creating a Hierarchy for the Source Dimension

So I am still playing around with building an Analysis Services cube from the pageview data I recently started collecting.  Over this past weekend I added a hierarchy to the Source dimension of my cube that gives me a bit more insight about how visitors find their way to my site.  If you read my last post, you saw that I could view my pageview data broken down by two pretty general traffic sources: direct traffic and referring links ...

  image

Well, now with my new hierarchy, I can not only view hit counts by direct traffic, referring site, or search engine ...

image

... but I can also drill into each of these sources and view my pageviews at a more granular level.

image

I can do this because I have added a bit of preprocessing logic to my cube creation process that sends all referring urls through a very simple rule engine that assigns the referring url a category and subcategory.  There is some source code a little further below, but the rules look something like this ...

   1: -- Search Engine Rules
   2:  
   3: if ref_url's domain like google.com and ref_url querystring contains the q token then Category=SearchEngine, SubCategory=Google 
   4:  
   5: if ref_url's domain like yahoo.com and ref_url querystring contains the p token then Category=SearchEngine, SubCategory=Yahoo
   6:  
   7: -- Community Referrers
   8:  
   9: if ref_url's domain like dotnetkicks.com then Category=Referrer, SubCategory=Community
  10:  
  11: if ref_url's domain like digg.com then Category=Referrer, SubCategory=Community
  12:  
  13: -- Internal Referrers
  14:  
  15: if ref_url's domain like mattberseth.com then Category=Referrer, SubCategory=Internal
  16:  
  17: if ref_url's domain like mattberseth2.com then Category=Referrer, SubCategory=Internal
  18:  
  19: ...

 

So I took these rules (and a few others) and created a quick and dirty console app that rips through all of the my pageviews and assigns a category and subcategory based on these rules.  The console app is a total of 250 LOC, so I am not going to post all of the source, but this is the gist of it (I removed the SqlClient stuff where I do the getting and putting)

   1: class Program
   2: {
   3:     static void Main(string[] args)
   4:     {
   5:         List<TrafficSource> trafficSources = new List<TrafficSource>()
   6:         {
   7:             //  Search Engines
   8:             new SearchEngine(){ Name="google", SourceID=3, QueryToken="q", IsMatch = uri => uri.Host.IndexOf("google", StringComparison.OrdinalIgnoreCase) >= 0 && uri.LocalPath == "/search" },
   9:             new SearchEngine(){ Name="yahoo", SourceID=4, QueryToken="p", IsMatch = uri => uri.Host.IndexOf("search.yahoo", StringComparison.OrdinalIgnoreCase) >= 0 && uri.LocalPath == "/search" },
  10:  
  11:             //  Social
  12:             new TrafficSource(){ Name="dotnetkicks", SourceID=6, IsMatch = uri => uri.Host.IndexOf("dotnetkicks.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  13:             new TrafficSource(){ Name="digg", SourceID=6, IsMatch = uri => uri.Host.IndexOf("digg.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  14:             new TrafficSource(){ Name="reddit", SourceID=6, IsMatch = uri => uri.Host.IndexOf("reddit.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  15:             new TrafficSource(){ Name="stumbleupon", SourceID=6, IsMatch = uri => uri.Host.IndexOf("stumbleupon.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  16:  
  17:             //  Community
  18:             new TrafficSource(){ Name="weblogs.asp.net", SourceID=6, IsMatch = uri => uri.Host.IndexOf("weblogs.asp.net", StringComparison.OrdinalIgnoreCase) >= 0 },
  19:  
  20:             //  Forums
  21:             new TrafficSource(){ Name="forums.asp.net", SourceID=7, IsMatch = uri => uri.Host.IndexOf("forums.asp.net", StringComparison.OrdinalIgnoreCase) >= 0 },
  22:             new TrafficSource(){ Name="expertsexchange", SourceID=7, IsMatch = uri => uri.Host.IndexOf("experts-exchange.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  23:             new TrafficSource(){ Name="stackoverflow", SourceID=7, IsMatch = uri => uri.Host.IndexOf("stackoverflow.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  24:  
  25:             //  Internal
  26:             new TrafficSource(){ Name="mattberseth.com", SourceID=8, IsMatch = uri => uri.Host.IndexOf("mattberseth.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  27:             new TrafficSource(){ Name="mattberseth2.com", SourceID=8, IsMatch = uri => uri.Host.IndexOf("mattberseth2.com", StringComparison.OrdinalIgnoreCase) >= 0 }
  28:         };
  29:  
  30:         //  default uncategorized traffic source
  31:         TrafficSource defaultTrafficSource = new TrafficSource() { Name = "Other", SourceID = 9, IsMatch = uri => true };
  32:         //  direct traffic
  33:         TrafficSource directTrafficSource = new TrafficSource() { Name = "Direct", SourceID = 0, IsMatch = uri => true };
  34:  
  35:         bool done = false;
  36:         while (!done)
  37:         {
  38:             //  fetch the next batch of page views
  39:             List<PageView> pageviews = PageView.Fetch();
  40:  
  41:             //  get all of the visits that contain referring urls
  42:             foreach (PageView pageview in pageviews)
  43:             {
  44:                 if (pageview.RefUri == null)
  45:                 {
  46:                     //  direct traffic, assign the direct traffic source
  47:                     pageview.Source = directTrafficSource;
  48:                 }
  49:                 else
  50:                 {
  51:                     foreach (TrafficSource source in trafficSources)
  52:                     {
  53:                         //  see if it matches any of the existing rules we have setup
  54:                         if (source.IsMatch(pageview.RefUri))
  55:                         {
  56:                             //  add it to the group
  57:                             pageview.Source = source;
  58:                             break;
  59:                         }
  60:                     }
  61:  
  62:                     if (pageview.Source == null)
  63:                     {
  64:                         //  else its just some random ref we don't care enough about
  65:                         //  to further categorize
  66:                         pageview.Source = defaultTrafficSource;
  67:                     }
  68:                 }
  69:  
  70:  
  71:                 //  update the record
  72:                 pageview.Update();
  73:             }
  74:  
  75:             done = pageviews.Count == 0;
  76:         }
  77:     }
  78: }
  79:  
  80: class TrafficSource
  81: {
  82:     public int SourceID { get; set; }
  83:     public string HostName { get; set; }
  84:     public string Name { get; set; }
  85:  
  86:     public Func<Uri, bool> IsMatch { get; set; }
  87: }
  88:  
  89: class SearchEngine : TrafficSource
  90: {
  91:     public string QueryToken { get; set; }
  92:  
  93:     public string ParseKeywords(Uri uri)
  94:     {
  95:         return HttpUtility.ParseQueryString(uri.Query)[this.QueryToken];
  96:     }
  97: }
  98:  
  99: class PageView
 100: {
 101:     public int ID { get; set; }
 102:     public Uri Uri { get; set; }
 103:     public Uri RefUri { get; set; }
 104:     public TrafficSource Source { get; set; }
 105:  
 106:     public void Update()
 107:     {
 108:         // update the pageview with the category/subcategory
 109:     }
 110:  
 111:     public static List<PageView> Fetch()
 112:     {
 113:         // get the next batch of pageviews that need to be categorized
 114:     }
 115: }

 

And the output of this preprocessing a table that contains the set of domains that link to my blog, as well as that referring domains category and subcategory.  I then take this table and use it as the data source for my Source dimension.

  image

And after reprocessing the cube, you can write what ever custom MDX queries you want.  The one below calculates the what percent of the daily total the traffic category makes up.

   1: with
   2:     member [Measures].[Direct Traffic Hits] as ([Source].[Hierarchy].[Category Name].&[Direct], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
   3:     member [Measures].[Search Engines Hits] as ([Source].[Hierarchy].[Category Name].&[Search Engine], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'    
   4:     member [Measures].[Community Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Community], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
   5:     member [Measures].[Forums Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Forums], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'    
   6:     member [Measures].[Internal Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Internal], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'    
   7:     member [Measures].[Unknown Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Unknown], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
   8:     member [Measures].[All Hits] as ([Measures].[Hit]), format_string = 'Standard'
   9: select
  10: non empty
  11: {
  12:     [Measures].[Direct Traffic Hits],
  13:     [Measures].[Search Engines Hits],    
  14:     [Measures].[Community Hits],
  15:     [Measures].[Forums Hits],
  16:     [Measures].[Internal Hits],
  17:     [Measures].[Unknown Hits],
  18:     [Measures].[All Hits]
  19: } on 0,
  20: non empty
  21: {
  22:     [Time].[Date].children
  23: } on 1
  24: from PageView

 

Executing the above MDX produces the following result set ...

image

 

And you can pipe this right into Excel to chart the results (thanks to ScottGu for the huge jump in referring traffic last Thursday) ...

image

 

Conclusion

I have got a ton of questions on how I am collecting my pageview data.  I don't have all of the kinks worked out yet, but I will probably write up a quick post next on how I am doing it.

 

That's it.  Enjoy!


TrackBack

TrackBack URL for this entry:
http://mattberseth.com/blog-mt/mt-tb.fcgi/155

Comments


Wow... look at that referring traffic!
Obviously... 7 out of 13 links on Gu's blog were yours!

And based on the quantity and quality of posts you churn out, you very well deserve it Matt.

Good on ya mate!

Posted by: Jody Brooks on October 8, 2008 08:30 AM

Matt,

I rarely come to your site but get your blog entries via a forwarder like feedburner (specifically, it's rssfwd.com I use)... Is that type of traffic counted as well somehow? Maybe it's in your referrers. Just thought I'd bring it up 'cause I read all your posts, but not directly on your site.


Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Consulting Services

Yep - I also offer consulting services. And heck, I'll do just about anything. If you enjoy my blog just drop me an email describing the work you need done.

Recent Comments

  • Jody Brooks wrote: Matt, I rarely come to your site but get your blog entries via a forwarder like feedburner (specifi...
  • Asheesh Soni wrote: Wow... look at that referring traffic! Obviously... 7 out of 13 links on Gu's blog were yours! And ...