Crawler information

To minimize the impact that our tracking code has on the pageload performance, we only send the minimum amount of data in those requests.

To extract the remaining metadata about the page, the Parse.ly Crawler makes followup requests to urls submitted by the Parse.ly Tracking Code. Since both the Parse.ly Tracking Code and the Parse.ly Crawler know the URL of the page, that's what they use to correlate the data.

How does the crawler work?

  • The Parse.ly Crawler uses the UserAgent: Mozilla/5.0 (compatible; parse.ly scraper/0.14; +http://parsely.com). Note that the version field can change over time.

  • Each unique URL is crawled about once per hour for the first 24 hours after being published. After that period, you must request a recrawl to update a url's metadata.

  • The set of IP addresses for our the Parse.ly Crawler worker machines is available in JSON format here: https://www.parse.ly/static/data/crawler-ips.json

  • The Parse.ly Crawler is a respectful web citizen. We do a number of things to limit the load that it puts on your servers:

    • The number of concurrent requests it opens to your server are limited to ensure it doesn't affect your concurrency throughput.

    • It caches articles, by URL and API_KEY, that it has already seen.

    • It introduces a small delay between HTTP requests to ensure the load is spread out.

    • It does not proactively spider your site; instead, pages are crawled only as they are visited by users. This way, archived articles that are not visited are not needlessly crawled.

  • In the first month of integration you will see more crawling activity than in future months, as Parse.ly will be crawling both new articles and existing articles as they receive visits. This will wane over time.

  • Finally, we must emphasize that crawling is an entirely back-end operation. That is, crawling in no way affects the pageload performance of your visitors coming to your site. It is an entirely asynchronous process done by Parse.ly's servers "after the fact."

How do I validate that a page is being crawled correctly?

To determine if the crawler is working correctly on a specific page, you can visit Validate Integration. This service checks to see if the Parse.ly Tracking Code will be able to track the page and if the crawler can extract the required metadata.

How do I find pages that have not been crawled correctly?

To find pages that have not been crawled correctly, navigate to your Parse.ly Dashboard, then click on Posts and filter by a Page Type of "Non-Post Pages."

https://dash.parsely.com/<API_KEY>/posts/?page_type=nonpost

This will display a list of pages that the Parse.ly Tracking Code has submitted that the Parse.ly Crawler has either identified as non-post pages or has yet to crawl. Most commonly, these are section fronts or list pages that do not represent an individual post. Occasionally, a page will be displayed that appears to be a post. This can mean that the crawler received an error from the server or the page did not contain the required metadata.

Do you have an urgent support question?