To minimize the bandwidth impact of Parse.ly tracking, the default tracker only sends a minimal amount of data about the page itself in each pageview request.
To collect the remaining metadata about the page, the Parse.ly Crawler makes followup requests to URLs submitted by the Parse.ly Tracking Code. Since both systems know the URL of a given page, that's how we associate metadata with the URL.
How does the crawler work?
- The Parse.ly Crawler uses the UserAgent:
Mozilla/5.0 (compatible; parse.ly scraper/0.14; +http://parsely.com). Note that the version field can change over time.
- Each unique URL is crawled about once per hour for the first 24 hours after sending it's first pageview, as long as it continues to send pageviews. After that period, you must request a recrawl to update a URL's metadata.
- The set of IP addresses for the Parse.ly Crawler worker machines is available in JSON format here: [https://www.parse.ly/static/data/crawler-ips.json](https://www.parse.ly/static/ data/crawler-ips.json)
The Parse.ly Crawler is a respectful web citizen. It does a number of things to limit the load that it puts on your servers:
- The number of concurrent requests it opens to your server are limited to ensure it doesn't affect your concurrency throughput.
- It caches articles, by URL and Parse.ly Site ID, that it has already seen.
- It introduces a small delay between HTTP requests to ensure the load is spread out.
- It does not proactively spider your site; instead, pages are crawled only as they are visited by users. This way, archived articles that are not visited are not needlessly crawled.
- In the first month of integration you will see more crawling activity than in future months, as Parse.ly will be crawling both new articles and existing articles as they receive visits. This will wane over time.
- Finally, we must emphasize that crawling is an entirely back-end operation. That is, crawling in no way affects the pageload performance of your visitors coming to your site. It is an entirely asynchronous process done by Parse.ly's servers "after the fact."
How does the crawler handle posts with multiple URLs?
It's common for a single post or piece of content to have multiple URLs
associated with it. For instance, it might have both a web URL and a Google AMP
URL, or it might be a gallery with multiple pages (
Unlike many other analytics systems, Parse.ly is built to reconcile these
various locations and representations of the same content. The way we do that is
by always retrieving metadata from the
for a post.
When the Parse.ly Crawler visits a URL, the first thing it checks is whether
that URL actually matches the
specified in the page metadata. If the URLs match, it will collect the rest of
the metadata on that page. If they don't, the crawler will follow the
specified and attempt to crawl that page, until it finds a page where there is
Importantly, the Parse.ly Crawler keeps track of the original URL, and once it finds the canonical page, it associates the URL as an "alias" of that canonical. Once aliased, pageviews to either URL will be aggregated together by default.
How do I validate that a page is being crawled correctly?
To determine if the crawler is working correctly on a specific page, you can visit Validate Integration. This service checks to see if the Parse.ly Tracking Code will be able to track the page and if the crawler can extract the required metadata.
How do I find pages that have not been crawled correctly?
To find pages that have not been crawled correctly, navigate to your Parse.ly Dashboard, then click on Posts and filter by a Page Type of "Non-Post Pages."
This will display a list of pages that the Parse.ly Tracking Code has submitted that the Parse.ly Crawler has either identified as non-post pages or has yet to crawl. Most commonly, these are section fronts or list pages that do not represent an individual post. Occasionally, a page will be displayed that appears to be a post. This can mean that the crawler received an error from the server or the page did not contain the required metadata.