v0.2 of my Visit/PageView Cube - Creating a Hierarchy for the Source Dimension
So I am still playing around with building an Analysis Services cube from the pageview data I recently started collecting. Over this past weekend I added a hierarchy to the Source dimension of my cube that gives me a bit more insight about how visitors find their way to my site. If you read my last post, you saw that I could view my pageview data broken down by two pretty general traffic sources: direct traffic and referring links ...
Well, now with my new hierarchy, I can not only view hit counts by direct traffic, referring site, or search engine ...
... but I can also drill into each of these sources and view my pageviews at a more granular level.
I can do this because I have added a bit of preprocessing logic to my cube creation process that sends all referring urls through a very simple rule engine that assigns the referring url a category and subcategory. There is some source code a little further below, but the rules look something like this ...
1: -- Search Engine Rules 2: 3: if ref_url's domain like google.com and ref_url querystring contains the q token then Category=SearchEngine, SubCategory=Google
4: 5: if ref_url's domain like yahoo.com and ref_url querystring contains the p token then Category=SearchEngine, SubCategory=Yahoo
6: 7: -- Community Referrers 8: 9: if ref_url's domain like dotnetkicks.com then Category=Referrer, SubCategory=Community
10: 11: if ref_url's domain like digg.com then Category=Referrer, SubCategory=Community
12: 13: -- Internal Referrers 14: 15: if ref_url's domain like mattberseth.com then Category=Referrer, SubCategory=Internal
16: 17: if ref_url's domain like mattberseth2.com then Category=Referrer, SubCategory=Internal
18: 19: ...
So I took these rules (and a few others) and created a quick and dirty console app that rips through all of the my pageviews and assigns a category and subcategory based on these rules. The console app is a total of 250 LOC, so I am not going to post all of the source, but this is the gist of it (I removed the SqlClient stuff where I do the getting and putting)
1: class Program
2: {3: static void Main(string[] args)
4: {5: List<TrafficSource> trafficSources = new List<TrafficSource>()
6: {7: // Search Engines
8: new SearchEngine(){ Name="google", SourceID=3, QueryToken="q", IsMatch = uri => uri.Host.IndexOf("google", StringComparison.OrdinalIgnoreCase) >= 0 && uri.LocalPath == "/search" },
9: new SearchEngine(){ Name="yahoo", SourceID=4, QueryToken="p", IsMatch = uri => uri.Host.IndexOf("search.yahoo", StringComparison.OrdinalIgnoreCase) >= 0 && uri.LocalPath == "/search" },
10: 11: // Social
12: new TrafficSource(){ Name="dotnetkicks", SourceID=6, IsMatch = uri => uri.Host.IndexOf("dotnetkicks.com", StringComparison.OrdinalIgnoreCase) >= 0 },
13: new TrafficSource(){ Name="digg", SourceID=6, IsMatch = uri => uri.Host.IndexOf("digg.com", StringComparison.OrdinalIgnoreCase) >= 0 },
14: new TrafficSource(){ Name="reddit", SourceID=6, IsMatch = uri => uri.Host.IndexOf("reddit.com", StringComparison.OrdinalIgnoreCase) >= 0 },
15: new TrafficSource(){ Name="stumbleupon", SourceID=6, IsMatch = uri => uri.Host.IndexOf("stumbleupon.com", StringComparison.OrdinalIgnoreCase) >= 0 },
16: 17: // Community
18: new TrafficSource(){ Name="weblogs.asp.net", SourceID=6, IsMatch = uri => uri.Host.IndexOf("weblogs.asp.net", StringComparison.OrdinalIgnoreCase) >= 0 },
19: 20: // Forums
21: new TrafficSource(){ Name="forums.asp.net", SourceID=7, IsMatch = uri => uri.Host.IndexOf("forums.asp.net", StringComparison.OrdinalIgnoreCase) >= 0 },
22: new TrafficSource(){ Name="expertsexchange", SourceID=7, IsMatch = uri => uri.Host.IndexOf("experts-exchange.com", StringComparison.OrdinalIgnoreCase) >= 0 },
23: new TrafficSource(){ Name="stackoverflow", SourceID=7, IsMatch = uri => uri.Host.IndexOf("stackoverflow.com", StringComparison.OrdinalIgnoreCase) >= 0 },
24: 25: // Internal
26: new TrafficSource(){ Name="mattberseth.com", SourceID=8, IsMatch = uri => uri.Host.IndexOf("mattberseth.com", StringComparison.OrdinalIgnoreCase) >= 0 },
27: new TrafficSource(){ Name="mattberseth2.com", SourceID=8, IsMatch = uri => uri.Host.IndexOf("mattberseth2.com", StringComparison.OrdinalIgnoreCase) >= 0 }
28: }; 29: 30: // default uncategorized traffic source
31: TrafficSource defaultTrafficSource = new TrafficSource() { Name = "Other", SourceID = 9, IsMatch = uri => true };
32: // direct traffic
33: TrafficSource directTrafficSource = new TrafficSource() { Name = "Direct", SourceID = 0, IsMatch = uri => true };
34: 35: bool done = false;
36: while (!done)
37: {38: // fetch the next batch of page views
39: List<PageView> pageviews = PageView.Fetch(); 40: 41: // get all of the visits that contain referring urls
42: foreach (PageView pageview in pageviews)
43: {44: if (pageview.RefUri == null)
45: {46: // direct traffic, assign the direct traffic source
47: pageview.Source = directTrafficSource; 48: }49: else
50: {51: foreach (TrafficSource source in trafficSources)
52: {53: // see if it matches any of the existing rules we have setup
54: if (source.IsMatch(pageview.RefUri))
55: {56: // add it to the group
57: pageview.Source = source;58: break;
59: } 60: } 61: 62: if (pageview.Source == null)
63: {64: // else its just some random ref we don't care enough about
65: // to further categorize
66: pageview.Source = defaultTrafficSource; 67: } 68: } 69: 70: 71: // update the record
72: pageview.Update(); 73: } 74: 75: done = pageviews.Count == 0; 76: } 77: } 78: } 79: 80: class TrafficSource
81: {82: public int SourceID { get; set; }
83: public string HostName { get; set; }
84: public string Name { get; set; }
85: 86: public Func<Uri, bool> IsMatch { get; set; }
87: } 88: 89: class SearchEngine : TrafficSource
90: {91: public string QueryToken { get; set; }
92: 93: public string ParseKeywords(Uri uri)
94: {95: return HttpUtility.ParseQueryString(uri.Query)[this.QueryToken];
96: } 97: } 98: 99: class PageView
100: {101: public int ID { get; set; }
102: public Uri Uri { get; set; }
103: public Uri RefUri { get; set; }
104: public TrafficSource Source { get; set; }
105: 106: public void Update()
107: {108: // update the pageview with the category/subcategory
109: } 110: 111: public static List<PageView> Fetch()
112: {113: // get the next batch of pageviews that need to be categorized
114: } 115: }
And the output of this preprocessing a table that contains the set of domains that link to my blog, as well as that referring domains category and subcategory. I then take this table and use it as the data source for my Source dimension.
And after reprocessing the cube, you can write what ever custom MDX queries you want. The one below calculates the what percent of the daily total the traffic category makes up.
1: with
2: member [Measures].[Direct Traffic Hits] as ([Source].[Hierarchy].[Category Name].&[Direct], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
3: member [Measures].[Search Engines Hits] as ([Source].[Hierarchy].[Category Name].&[Search Engine], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
4: member [Measures].[Community Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Community], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
5: member [Measures].[Forums Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Forums], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
6: member [Measures].[Internal Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Internal], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
7: member [Measures].[Unknown Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Unknown], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
8: member [Measures].[All Hits] as ([Measures].[Hit]), format_string = 'Standard'
9: select
10: non empty 11: { 12: [Measures].[Direct Traffic Hits],13: [Measures].[Search Engines Hits],
14: [Measures].[Community Hits], 15: [Measures].[Forums Hits], 16: [Measures].[Internal Hits],17: [Measures].[Unknown Hits],
18: [Measures].[All Hits]
19: } on 0,
20: non empty 21: {22: [Time].[Date].children
23: } on 1
24: from PageView
Executing the above MDX produces the following result set ...
And you can pipe this right into Excel to chart the results (thanks to ScottGu for the huge jump in referring traffic last Thursday) ...
Conclusion
I have got a ton of questions on how I am collecting my pageview data. I don't have all of the kinks worked out yet, but I will probably write up a quick post next on how I am doing it.
That's it. Enjoy!
Comments
Wow... look at that referring traffic!
Obviously... 7 out of 13 links on Gu's blog were yours!
And based on the quantity and quality of posts you churn out, you very well deserve it Matt.
Good on ya mate!
Matt,
I rarely come to your site but get your blog entries via a forwarder like feedburner (specifically, it's rssfwd.com I use)... Is that type of traffic counted as well somehow? Maybe it's in your referrers. Just thought I'd bring it up 'cause I read all your posts, but not directly on your site.