Tech docs
Integration Docs
Importing Post Metadata

Importing Post Metadata

When importing raw logs, Parse.ly needs to be able to join this data to post metadata. Normally, we would crawl a customer's site to find this, but this is sometimes either difficult or impossible due to URLs changing or breaking over time. Parse.ly is able to import metadata to help with the overall data import process.

Each line in this file should represent a post for which there is pageview data in the pageview data file. URLs in this file should match URLs in the pageview data file. We do basic URL cleaning, like removing campaign querystring arguments, so we don't need every singe URL contained in the pageview logs. Instead, we need to match the cleaned URL in the pageview file to a URL, using the canonical_url or urls field, in this file.

For example:

http://example.com/story_1?utm_medium=email

would be cleaned to:

http://example.com/story_1

This is beacuse the utm_medium querysting arugment isn't used to identify the post. In this case, the argument is used for campaign tracking and therefore irrelevant to matching pageviews to metadata.

However, if your site uses querystring arguments to identify the post, we would not remove that argument as part of our cleaning. For example:

http://example.com/stories?storyid=1&utm_medium=email

would be cleaned to:

http://example.com/stories?storyid=1

Please note, this only applies to post pages. Section pages, front pages, and other non-post pages do not need to be included in this dataset.

#Example

The following is an example of an article which was published on the Parse.ly blog. Due to the complex nature of this data format, we use Newline Delimited JSON instead of CSV.

{
	"apikey": "blog.parsely.com",
	"authors": ["Andrew Montalenti",  "Matthew Carrigan"],
	"canonical_url": "http://blog.parse.ly/post/7790/machine-learning-nlp-parse-ly-currents/",
	"full_content": "Machine learning for news: ... ",
	"image_url": "https://i1.wp.com/blog.parse.ly/wp-content/uploads/2018/09/currents-nlp.png?resize=150%2C150&ssl=1",
	"post_id": "http://blog.parse.ly/post/7790/machine-learning-nlp-parse-ly-currents/",
	"pub_date": 1537797645000,
	"section": "Parse.ly Tech",
	"tags": ["currents", "machine learning", "natural language processing", "parse.ly tech"],
	"title": "Machine learning for news: the NLP engine behind Parse.ly Currents",
	"urls": [
		"http://blog.parse.ly/post/7790/machine-learning-nlp-parse-ly-currents/",
		"https://blog.parse.ly/post/7790/machine-learning-nlp-parse-ly-currents/",
		"https://blog.parse.ly/post/7790/machine-learning-nlp-parse-ly-currents/amp/"
	]
}

(Note: the line breaks above are for readability. The data import file should have one JSON object per line)

#Fields

#apikey [required]

A string representing the name of the Parse.ly apikey for which the event should be counted. This value will be provided by Parse.ly for each site tracked.

#authors

A list of one or more authors for the article.

#canonicalurl [required]_

The main URL to use for this post. Since a post can have multiple URLs that point to it, we need to have a single URL which identifies the post.

#full_content

The full text of the post.

#image_url

URL pointing to the main image for this post.

#post_id

String that uniquely identifies this post. This defaults to canonical_url and should only be specified if your site's integration is providing a post_id different from that default. The value for post_id here should match what we scrape from your website.

#pub_date

Publish date of the post, in seconds since the epoch.

#section

The section the post was published under.

#tags

A list of any tags associted with the post.

#title [required]

The title of the post.

#urls

A list of URLs which share the same canonical_url. These may be additional pages to the post, different domains the post was published under, mobile versions of the post, etc.

rocket emoji