Crawler information

To minimize the bandwidth impact of tracking, the default tracker only sends a minimal amount of data about the page itself in each pageview request.

To collect the remaining metadata about the page, the Crawler makes followup requests to URLs submitted by the Tracking Code. Since both systems know the URL of a given page, that's how we associate metadata with the URL.

How does the crawler work?

  • The Crawler uses the UserAgent: Mozilla/5.0 (compatible; scraper/0.14; + Note that the version field can change over time.
  • Each unique URL is crawled about once per hour for the first 24 hours after sending it's first pageview, as long as it continues to send pageviews. After that period, you must request a recrawl to update a URL's metadata.
  • The set of IP addresses for the Crawler worker machines is available in JSON format here: []( data/crawler-ips.json)
  • The Crawler is a respectful web citizen. It does a number of things to limit the load that it puts on your servers:

    • The number of concurrent requests it opens to your server are limited to ensure it doesn't affect your concurrency throughput.
    • It caches articles, by URL and Site ID, that it has already seen.
    • It introduces a small delay between HTTP requests to ensure the load is spread out.
    • It does not proactively spider your site; instead, pages are crawled only as they are visited by users. This way, archived articles that are not visited are not needlessly crawled.
  • In the first month of integration you will see more crawling activity than in future months, as will be crawling both new articles and existing articles as they receive visits. This will wane over time.
  • Finally, we must emphasize that crawling is an entirely back-end operation. That is, crawling in no way affects the pageload performance of your visitors coming to your site. It is an entirely asynchronous process done by's servers "after the fact."

How does the crawler handle posts with multiple URLs?

It's common for a single post or piece of content to have multiple URLs associated with it. For instance, it might have both a web URL and a Google AMP URL, or it might be a gallery with multiple pages (/page/1, /page/2, etc.). Unlike many other analytics systems, is built to reconcile these various locations and representations of the same content. The way we do that is by always retrieving metadata from the canonical URL for a post.

When the Crawler visits a URL, the first thing it checks is whether that URL actually matches the url property specified in the page metadata. If the URLs match, it will collect the rest of the metadata on that page. If they don't, the crawler will follow the url specified and attempt to crawl that page, until it finds a page where there is a match.

Importantly, the Crawler keeps track of the original URL, and once it finds the canonical page, it associates the URL as an "alias" of that canonical. Once aliased, pageviews to either URL will be aggregated together by default.

How do I validate that a page is being crawled correctly?

To determine if the crawler is working correctly on a specific page, you can visit Validate Integration. This service checks to see if the Tracking Code will be able to track the page and if the crawler can extract the required metadata.

How do I find pages that have not been crawled correctly?

To find pages that have not been crawled correctly, navigate to your Dashboard, then click on Posts and filter by a Page Type of "Non-Post Pages."<SITE_ID>/posts/?page_type=nonpost

This will display a list of pages that the Tracking Code has submitted that the Crawler has either identified as non-post pages or has yet to crawl. Most commonly, these are section fronts or list pages that do not represent an individual post. Occasionally, a page will be displayed that appears to be a post. This can mean that the crawler received an error from the server or the page did not contain the required metadata.

rocket emoji