Raw Data Schema

JSON Format

Whether you access your raw data via S3 (bulk) or Kinesis (streaming), you are going to be dealing with lines of JSON objects, aka JSONLines.

This is generally very easy to parse in every programming language, cloud SQL engine, and big data tool.

This page describes the schema of these JSON records (keys and values) so that you can interpret the raw events as they come in.

Example JSON Page View Record

We'll start with an example pageview record, with the keys simply alphabetically sorted, from one of our sites.

You'll notice that for the most part, these are straightforward key-value pairs, typically strings, but occasionally numbers, null, or booleans (true / false).

Base Event Fields

name description example value
action event type identifier "pageview"
apikey site identifier "mashable.com"
referrer raw referring URL "http://facebook.com/instantarticles#v1"
session_id Session identifier 1
user_agent Raw User-Agent (UA) string "Mozilla/5.0 (iPhone; CPU ... Safari/601.1"
url Raw URL on which action occurred "http://mashable.com/1234#d3d"
vistior_site_id Visitor first-party site identifier "0beabdd1-7b0c-423b-9fae-660101fc8953"

These are the raw required fields we get from your integration with our data collection infrastructure, whether that's:

They will be present in every single event, regardless of event type or source. Note that excluding the session_id and the visitor_ip fields is possible, though all of our integrations attempt to support these fields to the best of their ability.

On One-Time Historical Imports

Customers often ask us whether it might be possible to do a one-time import of historical pageview (or other) event data from legacy web analytics systems. The answer to this question is "yes", though it does require some custom work on Parse.ly's side. We also need to have equivalents for the above "Base Event" fields in order to make sense of your historical data.

Timestamp Fields

We record two raw timestamps per event. One comes from our data collection servers and one comes from our client-side trackers. These are stored as numbers that represent seconds since the UNIX epoch, aka UNIX time. Our server clocks are in UTC.

  • timestamp_info_nginx_ms is an automatic server-side event timestamp
  • timestamp_info_pixel_ms is an automatic client-side event timestamp
  • timestamp_info_override_ms is a client-side override timestamp

In general, Parse.ly's internal attitude is, "the client-side timestamp cannot be trusted". However, there are situations in which it can make sense to trust it over server timestamp.

Our nginx (server-side) timestamp is at second resolution, whereas our pixel (client-side) timestamp is at millisecond resolution. If a pixel timestamp is within a few seconds of the corresponding nginx timestamp, it is likely more accurate. It represents when the event was sent, at millisecond resolution, rather than when the event was received, at second resolution. With our standard JavaScript tracker, both nginx and pixel are always captured together, so combining them lets us make JavaScript tracker-based events be as accurate as possible.

In mobile SDKs for iOS and Android, it is common to "batch" events if devices are offline. These are also known as "late-arriving" events. In these cases, neither the auto-generated server-side timestamp (in nginx) nor the auto-generated client-side timestamp (in pixel) can be trusted; instead, the client-side override timestamp may be a more accurate representation of reality. The mobile SDK populates these by filling a ts field in the data key-value object sent with every event.

On Timezones

Parse.ly's JavaScript tracker populates the client-side timestamp using new Date().getTime(), which means that it is in UTC. Our server clocks are also in UTC. So, these should be comparable. However note that the UNIX time itself does not embed any timezone information. It simply represents the number of seconds since a specific UTC time in the past, the UNIX epoch. Your could try to infer the user's local timezone from their IP address, based on their estimated geography. If you combine these fields, you can interpret the user's local time.

Event ID

name description example value
event_id unique event identifier string "0xe6508eda93d5598367b18555ae9b828d"

A unique, hex-encoded ID string is also generated for each Event. This property can be used to deduplicate events for easier ingestion and processing.

This unique ID is generated by hashing the values of apikey, action, url, timestamp (internal, generated property), visitor_site_id, and timestamp_info_pixel_ms. To ensure that each event_id is truly unique, make sure that all events sent to Parse.ly provide all of these required fields (excluding timestamp, which is generated on our side) at an appropriate level of cardinality and granularity.

For example, if visitor_site_id is not provided for a series of events, then the only properties able to generate unique values for those events are the event type and the timestamp.

Session Enrichments

Parse.ly's JavaScript tracker automatically creates some useful session information that can help with user session analysis. For one thing, Parse.ly's session_id also doubles as a "number of visits" value, since it's an auto-incrementing integer that starts at 1 and moves up by one for every new visit by a visitor with the same visitor_site_id.

Note that these enrichments are done client-side by Parse.ly's JavaScript tracker; they will not apply to events that arrive via other integrations.

The other fields stored with the session are described below:

name description example value
session_id auto-incrementing session identifier, unique to visitor_site_id 1
session_initial_referrer the raw referring URL of the first pageview event of this session "http://facebook.com"
session_initial_url the raw URL of the first pageview event of this session "http://mashable.com/1234#d3d"
session_last_session_timestamp Timestamp of the last visit, or 0 if none 0
session_timestamp Timestamp of first pageview event of this session 1466214847371

Timestamp Enrichments

Based on the above timestamp fields, we also create an important field called ts_action. This is timestamp_info_nginx_ms (our server time) re-interpreted as a formatted date string that is highly compatible with a number of systems. For example, it is the same format expected by Amazon Redshift and Google BigQuery's JSON value parsers.

  • ts_action: "2016-06-18 02:03:24"

This value above is derived from epoch time 1466215404000; it also lacks timezone information but can be interpreted as a UTC time. It turns out, including timezone information as one might for the "full" ISO8601 standard makes this string incomaptible with some SQL engines, so we chose a maximally compatible format, instead.

URL and Referrer Enrichments

Based on the url, referrer, session_initial_url and session_initial_referrer fields, we provide a number of enrichments. For the sake of illustration, we'll assume the following values:

field value
url "https://www.example.com/article-1234?campaignid=1234#fragment"
referrer "https://www.google.ca/"
session_initial_url "https://www.example.com/article-1234?campaignid=1234#fragment"
session_initial_referrer "https://www.google.ca/"

On URL Parsing

Attributes added to parsed URLs such as: fragment, netloc, params, query and scheme adhere to RFC 1808.

name description example value
url_clean Cleaned url (strip query/fragment) "https://www.example.com/article-1234"
url_domain url parsed domain, matched against TLD list "example.com"
url_fragment Fragment portion of url "fragment"
url_netloc Netloc portion of url "www.example.com"
url_params Params portion of url ""
url_path Path portion of url "/article-1234"
url_query Query portion of url "campaignid=1234"
url_scheme Scheme portion of url "https"
ref_category referrer category (traffic source categorization) "search"
ref_clean Clean referrer URL (strip query/fragment) "https://www.google.ca/"
ref_domain referrer parsed domain, matched against TLD list "google.ca"
ref_fragment Fragment portion of referrer ""
ref_netloc Netloc portion of referrer "www.google.ca"
ref_params Params portion of referrer ""
ref_path Path portion of referrer "/"
ref_query Query portion of referrer ""
ref_scheme Scheme portion of referrer "https"
surl_clean Cleaned session_initial_url (strip query/fragment) "https://www.example.com/article-1234"
surl_domain session_initial_url parsed domain, matched against TLD list "example.com"
surl_fragment Fragment portion of session_initial_url "fragment"
surl_netloc Netloc portion of session_initial_url "www.example.com"
surl_params Params portion of session_initial_url ""
surl_path Path portion of session_initial_url "/article-1234"
surl_query Query portion of session_initial_url "campaignid=1234"
surl_scheme Scheme portion of session_initial_url "https"
sref_category Session referrer category (traffic source categorization) "search"
sref_clean Clean session referrer URL (strip query/fragment) "https://www.google.ca/"
sref_domain Referrer parsed domain, matched against TLD list "google.ca"
sref_fragment Fragment portion of session_initial_referrer ""
sref_netloc Netloc portion of session_initial_referrer "www.google.ca"
sref_params Params portion of session_initial_referrer ""
sref_path Path portion of session_initial_referrer "/"
sref_query Query portion of session_initial_referrer ""
sref_scheme Scheme portion of session_initial_referrer "https"

Metadata

Whether metadata was crawled via JSON-LD or passed directly in pixels (as is the case in Parse.ly's video integration), metadata associated with the url field is passed along in a series of metadata_ fields:

name description example value
metadata_authors Array of authors for the post/video ["Albert Einstein", "Richard Feynman"]
metadata_canonical_url The canonical URL of a post, or in the case of videos, the video ID "http://www.example.com/article-1234"
metadata_pub_date_tmsp Publish date of the post in milliseconds since the UNIX epoch 1471392000000
metadata_custom_metadata String of optional custom metadata (for more information, see the integration docs "{\"internal_post_id\": \"2134\"}"
metadata_section Section the post/video was published in "Physics"
metadata_tags Array of tags associated with the post/video ["science", "physics", "quantum mechanics"]
metadata_title Title of the post/video "Thoughts on Quantum Electrodynamics"
metadata_image_url URL to image for the post/video "https://www.evernote.com/l/AAFSrhKOoExCqKji3f9BS9YKfZEC-yerafgB/image.png"
metadata_full_content_word_count Word count of the post (irrelevant for videos) 1562

UA and Device Enrichments

Based on the ua field, we enrich the following:

name description example value
ua_browser Browser derived from UA "Mobile Safari"
ua_browserversion Browser version derived from UA "9.1.2"
ua_devicebrand Device Brand derived from UA "Apple"
ua_devicemodel Device Model derived from UA "iPhone"
ua_devicetouchcapable Flag to indicate if device is touch capable true
ua_devicetype Device Type (mobile/tablet/desktop) from UA "mobile"
ua_os Device Operating System from UA "iOS"
ua_osversion Device Operating System version From UA "9.3"

We also provide information regarding the display of the device:

name description example value
display_avail_height available height of the display, in pixels (equivalent to JavaScript's screen.availHeight property) 877
display_avail_width available width of pixels (equivalent to JavaScript's screen.availWidth property) 1436
display_pixel_depth color resolution (in bits per pixel) 24
display_total_height total height of the display, in pixels 900
display_total_width total width of the display, in pixels 1440

UTM Parameter Enrichments

Based on the url field, we enrich the following from its query parameters. Note that "UTM parameters" are a web-wide defacto standard for campaign tracking that was first introduced by Urchin and Google Analytics. Google runs a free tool called the URL builder to build URLs with this format, but many tools will automatically add these parameters to allow for easier tracking, especially in places where HTTP referrers are not automatically set.

In this example, we take the above article URL, http://mashable.com/1234, and we assume that it were clicked from an email newsletter. It might then have had query parameters like the following (scroll to read):

http://mashable.com/1234?utm_source=newsletter_2016-06-01&utm_medium=email&utm_term=footer&utm_content=template_a&utm_campaign=subscriber_newsletter

Which would be parsed as follows:

name description example value
utm_campaign Campaign identifier or name "subscriber_newsletter"
utm_content Template or style (e.g. for A/B tests) "template_a"
utm_medium Medium campaign ran on (e.g. email, social) "email"
utm_source The specific identifier for the source content "newsletter_2016-06-01"
utm_term A keyword or term associated with the click "footer"

UTM parameter tracking is powerful because it allows you to do grouping, rollup, and slice-and-dice of your campaigns, which often have associated costs and thus can be part of an ROI calculation. It also helps tremendously with decoding "direct" traffic; e.g. in many email service providers, the above click from an email newsletter would have no HTTP referrer set, and thus UTM parameters would be the only way to understand this traffic.

Extra Data

Arbitrary key-value pairs can be passed to Parse.ly's dynamic tracking. In these cases, your key/value pairs will appear as a nested JSON object in the extra_data field.

As part of your own ETL, you can "flatten" these fields up into your root document format if you wish to include them in whatever downstream database in which you store Parse.ly raw data.

  • "action": "_scroll"
  • "extra_data": {"_y": 1430}

In this example, a custom event, _scroll, was sent to our data pipeline, and it had associated custom data, {"_y": 1430}, which represents 1,430 pixels on the y-axis of scroll-depth within the browser. This kind of raw data can be used to implement scroll depth tracking.

Other Possibilities

This raw data schema is already quite rich and allows for quite a large number of queries that are not supported in Parse.ly's dashboard or APIs. Nonetheless, you may want some help thinking through the possibilities of "what else" to store in your raw data events. For example:

  • subscriber identifiers, to do detailed loyalty analysis
  • more granular information about on-page or in-app activities
  • a specialized set of query parameters for social virality modeling
  • ad impression or revenue data
  • and anything else you can think up!

Next Steps

Read on for our Code Examples.

Or, get help from our team:

  • If you are already a Parse.ly customer, get in touch with us, and we'll be happy to consult you on advanced use cases for your raw data.

  • If you are not a Parse.ly customer, you'll first need to go through our basic integration, but we are glad to schedule a demo where we can share some of the awesome things our existing customers have done with this unlimited flexibility.

Do you have an urgent support question?