Parse.ly Recommendation Engine: An Explainer (part 1)
Helping site operators pick the right links
Parse.ly’s mission is to provide the best analytics-driven tools for running large content websites. As part of that effort, we provide a rich real-time and historical analytics dashboard, which understands content and the key engagement metrics around it. Typically, this dashboard is used for visitor growth strategies or to understand engagement levels around different kinds of content.
Parse.ly data indicates that most visits are single page view visits, and most visitors to large sites will only visit once in a given month. To build higher engagement on site and more loyalty, you need to offer your visitors experiences around content that are relevant, tailored, and data-driven.
A starting point for using data for increasing on-site recirculation is to use our
/analytics API to surface content that is getting the most traffic, or our
/shares API to surface content that is getting widely shared.
However, a second, deeper use of the API involves our content recommendation engine, which is exposed through four endpoints,
Understanding that Parse.ly has indexed your site content fully
/search endpoint showcases that Parse.ly has indexed 100 percent of your site’s content, including title, authors, section, tags, publication dates, image, and even the full text content. For example, issuing a query like
/search?q=ipad will pull back all content from your site that mentions the keyword “ipad.”
Parse.ly’s recommendation engine builds upon this store of content to provide a simple mechanism for rolling out better article-to-article recommendations. Using
/related, you can take any URL on your site and find a set of several “related” stories. Stories are related to one another based on a semantic relevance algorithm, that stems out of the post metadata and the keywords/topics discussed in the piece. For example, if the current URL discusses Apple and iPads, the
/related endpoint will likely surface other content discussing the same topics from your archive.
Tuning recommendation strategies
If you add the
days parameter, you can limit this to only recently-published, related content. Which raises the question of content “freshness.” Our research shows that fresher content tends to perform better, so our ranking algorithm for “related” stories automatically uses a notion of “date decay.” That is, the older the story, the less likely it is to appear at the top of a recommendation set.
If you add an
section parameter, you can limit recommendations to specific authors or sections. However, note that if an article is tagged similarly, written by the same author, or appears in the same section, it will automatically be considered “more relevant” to the current story. But since topical relevance is always prioritized over metadata matches, these parameters give you some extra control. You can likewise use the
exclude parameter to hide content tagged in certain ways, e.g. to hide sponsored content.
Going from contextual to personalized
Our personalized recommendation engine is a layer built atop the
/related endpoint and two other helpers, the
/history endpoints. If you fire a request against
/profile, it will “train” a URL against a given visitor UUID. This will tell the Parse.ly system to “remember” that this user visited that URL. You can then use the UUID in lieu of a URL for the
uuid parameter to
/related. This instructs the recommendation engine to look up the user’s profile — and based on the articles they visited, recommend some related stories. You can see a history of the articles they visited at the
A personalized recommendation is similar to a contextual one, but may get slightly higher click-through rates due to having more personalized information about the visitor. For example: let’s say a visitor reads three stories, one about soccer, one about golf, and one about tennis. Their most recent story is about tennis. Which stories will the recommender pick for this user?
The answer is: it will find all stories that are relevant to any of the past stories visited by the user, but will rank the stories based on the most recently visited content. So, in this case, the user will see tennis-related stories appearing first, because that is most recent. But, overall, most of the user’s recommendations will likely be sports-related. For example, let’s say this site only has one story about tennis on their site — and it’s the story the user already read. The personalized recommender is smart enough to hide already-visited articles. In this case, they’ll see recommendations for other sports — perhaps golf and soccer, or perhaps other ones altogether, as determined by semantic relevance or metadata matching. The fact that there are no other tennis-related stories in our archive does not stop the recommender from nonetheless making a good recommendation based on the other topics known to be of interest to this user.
What kind of recommendation engine is Parse.ly?
Within the computer science literature, our recommendation engine falls into the category of a “content-based recommender,” rather than one based on “collaborative filtering.” Concretely, our recommender is closer to Pandora-style than to Netflix-style — although, as we’ve learned over time, all recommenders eventually become “hybrid” in one way, shape, or form. In the case of Pandora, musical tastes are inferred from “metadata” around music — e.g. if you like a folk song, you will probably like other folk songs. In the case of Netflix, movie tastes are inferred from the crowd behavior around movie ratings — e.g. if you rate a Woody Allen film 5 stars, you might also like a Coen Brothers movie, because Woody Allen fans tend to rate those movies 5 stars, as well.
We found that a content-based recommender works better for content sites because our signal (a page/URL visit) is weak but our content metadata (the information about articles) is rich.
Especially with proper tuning, we doubt you’ll find a better content recommender on the market that is as easy to integrate, use, and customize, which is why ours has been adopted by the web’s largest content destinations like The New Yorker, Ars Technica, and Slate.