Hit Me Baby One More Time: Sabermetrics and Web Analytics


via Steve Burns

November 14th, 2005 marks the beginning of an era. On that date, Google first offered a free analytics tool to the public.

The Google Analytics Era

In April 2005, Google had acquired Urchin Software Corporation, a leading web analytics platform. Urchin calculated traffic by analyzing log data, a rather primitive method of collection. As early as 1993, programmers had realized that HTTP recorded the interactions of users on websites in a special archive called a log. When a visitor lands on a page, he makes “requests” to a server to access certain files. Those request are stored in a log file. A crafty analytics program can parse the requests stored in a log file to determine the number of “hits” on a web page.

According to the Oxford English Dictionary, the first recorded appearance of “hits”—in a computer science context, of course—was in Charles Sippl’s Computer Dictionary and Handbook, published in 1967: “In file maintenance, the finding of a match between a detail record and a master record.” The OED formally defines a computer “hit” as a “match” or “the percentage of records in a file which are accessed in the course of a processing task.; also used analogously in other computing contexts (esp. memory caching).”  Today, the vernacular “hits,” as in, “my blog got 987 hits today!,” means something a little different. Instead of a simple record of interaction, a hit refers to a page view. With the advent of more sophisticated browsing instruments, like proxies and dynamic IP addresses, log file analysis became obsolete. In the place of log files, Javascript tags allowed analytics providers to more accurately track user behavior.   Although Google bought Urchin’s log file system, it opted to develop its free service with a Javascript snippet.

The slippage of “hit” from a log file context into a more casual reference to page views reflects the rise of Google Analytics. We began to fetishize the page view because of log file logic, and even though we have moved beyond that technology, we still hold “hits” supreme.  

GA Was A Called Shot

Google Analytics disrupted the web analytics industry because it made previously premium services available to all. Paul Muret, then a Google engineering director and one of Urchin’s founders, told Business Wire on November 14th, 2005:

we want to give all online marketers and publishers access to powerful web analytics to help them better understand what their customers want. With this knowledge, businesses can create more accurate advertising and build better websites. By making this powerful service free, we aim to give all websites—large and small—the tools they need to better serve their customers, make more money, and improve the web experience for everyone.

In combination with Google AdWords and search engine optimization strategies, Google Analytics fulfilled Muret’s promises. All those tempting but tired lists, like top ten ways to peel an orange or change your motor oil, seven best Tony Danza cameos, or 27 rap songs from Portland? The segmentation of content into advertising rich and search engine sticky sections (as in this article)? The proliferation of internal links (again, here, here, and here)? These content innovations, not so subtle shifts in the way media gets made, are the children of the Google Analytics era. Analytics is not just retroactive knowledge; that is, it does not only tell you about past behavior on a website. Rather, it allows designers and content generators to make inferences about which content is likely to drive the most traffic, produce the most hits, and most importantly, make the most profit. Yet, Google Analytics set its own expiration date. Once everyone had Google Analytics, everyone had essentially the same competitive advantage. Armed with equal data, developers could differentiate themselves only on the basis of insight and web smarts. A once radical idea, a new way of handling statistics, has become so mainstream that it is a prerequisite to competition in the marketplace—but it no longer constitutes any meaningful edge.

Enter Sabermetrics  

Bill James has become something of a baseball myth. While working the night shift at a pork and beans factory, he started publishing The Bill James Baseball Abstracts, which approached complex baseball problems—like “how can you tell who is a good fielder” (edition the first, circa 1977)—from an elegant statistical perspective. The Bill James story has been so readily mythologized because it is a variation on a classic American trope: average Joe, secretly brilliant, labors in obscurity and after terrible tribulations, achieves recognition and success: which in turn adheres to a more general structure of heroism: mysterious man brings knowledge and power to the world, is rejected, and finally triumphs. See Moneyball for a transformation of the Bill James myth into the Billy Beane myth. (Britney Spears or “the pop star” is another example.) “Sabermetrics,” the field of statistics that James cooked up wandering around the cannery after-hours, is now accepted by the powers that baseball be; in effect, the sabermetrics approach, which privileges the value of players, measured in runs or wins, over conventional stats, has become conventional wisdom. Having incorporated marginal elements, the mainstream swallows the disruptive and innovative capacity of the outlier. The man, no longer radical, slipped into legend, the stuff of aspiration for sports geeks. As with “outsider art” that gets wall space in major galleries, there is a point at which a dominant system absorbs its counter-narratives. In the moments before incorporation, the outsider achieves greatest disruptive potential. In the aftermath, the revolution is first a memory, and then a tall-tale.  

Google Analytics Is Not Sabermetric

Given a convoluted analogy between baseball and the Internet, Google Analytics are not the equivalent of sabermetrics. No matter how “disruptive” contemporary analytics platforms might color their products, they have never reoriented our fundamental conceptions of Internet traffic. We still slavishly worship the hit. In our analogy, conventional analytics providers, like Google Analytics, correspond to the most traditional scouts and managers, who are the fans of baseball card stats like batting average and earned runs average. What are the baseball card stats of the Internet? Page views and visitors.  They’ve been around forever, we collected and compared them in our teens—”how’s your blogspot doing these days?”—and now, they are a comfortable yardstick for web site performance. Not surprisingly, advertisement revenue is typically paid out in impressions and clicks, rough correlates for “hits.” It seems as though the idea of a “hit” has been so thoroughly entrenched in our analytics frameworks that we cannot reimagine Internet behavior.  

Unfortunately, hits, page views, and visitors are not particularly useful ways of thinking about user interactions with web content. If the goal of a website is quality traffic—sustained user engagement—then volume of traffic is a poor measure of performance. Although joining the 3,000 hit club is an impressive feat for a baseball player and indicates an increased likelihood for other offensive records, it is not a reflection of direct contribution to a team’s wins. If all those hits came with no one on base and were followed by strikeouts, then they’re all vanity, all personal glory, no practical value. Hits alone, sabermetrics tells us, do not contribute to wins. Likewise, Internet hits alone do not contribute to sustained user engagement, wins for advertisers, or wins for e-commerce. Advertisers need new ways of understanding how to extract value from their hosts—and content generators, the host bodies, need new ways of understanding how to provide value to their advertising partners. In such a mutualistic relationship, it is the advertiser who usually loses. Imagine a world in which bees pollinate flowers but obtain no nutrition in return. The asymmetric exchange of resources between content generators and advertisers will be rectified by the next iteration of analytics tools.

Publishing Sabermetrics

What will the sabermetrics of web publishing look like? Baseball sabermetrics both defined new types of statistics like the “ultimate zone rating”—a measure of fielding performance—and reconfigured old statistics, as with “runs created,” a complex combination of various baseball card stats into a more direct estimation of value. At Parse.ly, we’ve developed some unique statistics, like “momentum,” which helps publishers understand the flow of traffic into their site. We’re also extracting information on topics—what an article is about—with a little help from natural language processing. And the simple addition of real-time data—the lack of which cripples Google Analytics—is changing how publishers think about content distribution. Eventually, user behavior analytics, like YouTube’s audience retention statistic and new e-book analytics platforms that evaluate reading preferences, will supplant Google Analytic’s baseball card stats. In combination with those conventional metrics—the “hits” paradigm—user behavior analytics will grant advertisers a better understanding of how web sites contribute value to campaigns. Although the “hits” paradigm has defined advertising value thus far, web analytics has the potential to demonstrate deeper and more profit driven value to advertisers. Subsequently, digital publishers will need to produce content that persistently immerses and engages readers, exposing them to advertising systems more efficiently.