Tech docs
Integration Docs
Crawler information

Crawler information

To minimize the bandwidth impact of Parse.ly tracking, the default tracker only sends a minimal amount of data about the page itself in each pageview request.

To collect the remaining metadata about the page, the Parse.ly Crawler makes followup requests to URLs submitted by the Parse.ly Tracking Code. Since both systems know the URL of a given page, that's how we associate metadata with the URL.

#How does the crawler work?

  • The Parse.ly Crawler uses the UserAgent: Mozilla/5.0 (compatible; parse.ly scraper/0.14; +http://parsely.com). Note that the version field can change over time.
  • Each unique URL is crawled about once per hour for the first 24 hours after sending it's first pageview, as long as it continues to send pageviews. After that period, you must request a recrawl to update a URL's metadata.
  • The set of IP addresses for the Parse.ly Crawler worker machines is available in JSON format here (last updated October 7, 2019):
{
  [
    "54.209.175.114",
    "107.21.43.157",
    "107.21.47.230",
    "54.86.145.133",
    "54.172.177.129",
    "54.4.124.120",
    "52.5.226.42",
    "52.44.47.254",
    "54.165.238.239",
    "52.200.241.230"
  ]
}
  • The Parse.ly Crawler is a respectful web citizen. It does a number of things to limit the load that it puts on your servers:

    • The number of concurrent requests it opens to your server are limited to ensure it doesn't affect your concurrency throughput.
    • It caches articles, by URL and Parse.ly Site ID, that it has already seen.
    • It introduces a small delay between HTTP requests to ensure the load is spread out.
    • It does not proactively spider your site; instead, pages are crawled only as they are visited by users. This way, archived articles that are not visited are not needlessly crawled.
  • In the first month of integration you will see more crawling activity than in future months, as Parse.ly will be crawling both new articles and existing articles as they receive visits. This will wane over time.
  • Finally, we must emphasize that crawling is an entirely back-end operation. That is, crawling in no way affects the pageload performance of your visitors coming to your site. It is an entirely asynchronous process done by Parse.ly's servers "after the fact."

#How does the crawler handle posts with multiple URLs?

It's common for a single post or piece of content to have multiple URLs associated with it. For instance, it may have both a web URL and a Google AMP URL, or it might be a gallery with multiple pages (/page/1, /page/2, etc.). Unlike many other analytics systems, Parse.ly is built to group together these various locations and representations of the same content. We do that by always retrieving metadata from the post's Parse.ly canonical URL.

When the Parse.ly Crawler visits a URL, the first thing it checks is whether that URL actually matches the Parse.ly canonical URL specified in the page metadata. If the URLs match, it will collect the rest of the metadata on that page, because it knows that it's found a Parse.ly canonical URL. If they don't, the crawler will instead navigate to the Parse.ly canonical URL and attempt to crawl that page, until it finds a page where there is a match.

Importantly, the Parse.ly Crawler keeps track of the original URL, and once it finds the Parse.ly canonical URL, it associates the URL as an "alias". Once aliased, pageviews to either URL will be grouped together by default.

The Parse.ly Crawler will also follow 301 or 302 redirects automatically, and will update the stored canonical or aliased URL as needed. When moving or changing a Parse.ly canonical URL, it's important that the original URL still resolves or redirects to the new location.

#How do I validate that a page is being crawled correctly?

To determine if the crawler is working correctly on a specific page, you can visit Validate Integration. This service checks to see if the Parse.ly Tracking Code will be able to track the page and if the crawler can extract the required metadata.

#How do I find pages that have not been crawled correctly?

To find pages that have not been crawled correctly, navigate to your Parse.ly Dashboard, then click on Posts and filter by a Page Type of "Non-Post Pages."

https://dash.parsely.com/<SITE_ID>/posts/?page_type=nonpost

This will display a list of pages that the Parse.ly Tracking Code has submitted that the Parse.ly Crawler has either identified as non-post pages or has yet to crawl. Most commonly, these are section fronts or list pages that do not represent an individual post. Occasionally, a page will be displayed that appears to be a post. This can mean that the crawler received an error from the server or the page did not contain the required metadata.

rocket emoji