Data Pipeline Docs
Getting Started Playbook

Getting Started Playbook

You are here because you have Data Pipeline, or you have the publicly available Data Pipeline data, and you are ready to get your hands dirty with some clean data!

#Watch the Data Pipeline getting started video

#Downloading Data Pipeline data using AWS CLI

Getting AWS CLI set up locally is very simple. Start here with these instructions from the AWS Documentation.

Setting up credentials (only necessary for accessing a private Data Pipeline S3 bucket, not the demo data public S3 bucket)

aws configure --profile parsely_dpl
AWS Access Key ID [None]: ENTER ACCESS ID
AWS Secret Access Key [None]: ENTER SECRET KEY
Default region name [None]: us-east-1
Default output format [None]: json

Download one file

Once AWS CLI is installed, it can be used to download a file from your Parse.ly Data Pipeline S3 bucket, or from the public Parse.ly S3 bucket with demo Data Pipeline data.

For the demo Parse.ly Data Pipeline data

aws s3 cp s3://parsely-dw-parse-ly-demo events/file_name.gz

For your private customer-specific S3 bucket Make sure to use the profile flag

aws s3 cp s3://parsely-dw-bucket-name-here events/file_name.gz --profile parsely_dpl

Download all the files

If you want to download all of the files in an S3 bucket, follow the below instructions. Caution, this might be a large amount of data!

For the demo Parse.ly Data Pipeline data

aws s3 cp s3://parsely-dw-parse-ly-demo . --recusive

For your private customer-specific S3 bucket Make sure to use the profile flag

aws s3 cp s3://parsely-dw-bucket-name-here . --recursive --profile parsely_dpl

Copy the data to your own S3 bucket

There are a few ways to do this! Using AWS CLI is a simple way to perform this task locally.

For the demo Parse.ly Data Pipeline data

aws s3 cp s3://parsely-dw-parse-ly-demo s3://your-bucket-here --recusive

For your private customer-specific S3 bucket Make sure to use the profile flag

aws s3 cp s3://parsely-dw-bucket-name-here s3://your-bucket-here --recursive --profile parsely_dpl

#Copy the Data Pipeline data to Redshift or Google BigQuery

Parse.ly has made a Github repository specific to these use cases!.

The README.md in the repository linked above is going to have the most detailed information for how to install and use it best. Below are a few examples of how to use it for handy tasks once you have the parsely_raw_data repo installed.

Copy the S3 data to your Redshift database

This command will create an Amazon Redshift table using the specified Parse.ly schema and load the Data Pipeline data into the new table.

python -m parsely_raw_data.redshift

Copy the S3 data to your Google BigQuery

This command will create a Google BigQuery table using the specified Parse.ly schema and load the Data Pipeline data into the new table

python -m parsely_raw_data.bigquery

#Query the Data Pipeline data using AWS Athena

Don't want to move any data? We got you covered! AWS Athena is a SQL interface that allows you to query S3 files directly.

  1. Create an Athena table using the Parse.ly Data Pipeline Athena schema
  2. Load the data into the recommended year-month partitions:
ALTER TABLE table_name_here ADD PARTITION (year='YYYY', month='MM') location 's3://parsely-dw-bucket-name-here/events/YYYY/MM'
  1. Use Athena to query the Data Pipeline data! athena_example_query

#Getting started queries to answer common questions

These queries as formatted as if you are using Athena to query the Data Pipeline data.

Retrieve all records

This query will retrieve all records inside the Athena table that is reading from the S3 files. Please note that this will only retrieve partitions that have loaded (see section above). The more specific you are about a partition you are querying, the cheaper your Athena query will be.

select *
from parsely_data_pipeline_table_name

Bot traffic investigation

Bot traffic is getting smarter! To investigate the user agent and IP address to a specific post on a certain day, you can use the following query as a template

select
    user_agent,
    visitor_ip,
    count(action) as pageviews
from parsely_data_pipeline_table_name
where
    year = 'yyyy' and --this makes the query cheaper!
    month = 'mm' and --this makes the query cheaper!
    action = 'pageview' and
    url like '%only-include-unique-url-path-here%' and
    date(ts_action) = 'yyyy-mm-dd'

Engaged-time by referrer type

This is a template query to retrieve engaged time by referrer category.

select
   channel,
   ref_category,
   sum(engaged_time_inc) as engaged_time_seconds,
   sum(engaged_time_inc)/60 as engaged_time_minutes
from parsely_data_pipeline_table_name
where
    year = 'yyyy' and
    month = 'mm'
group by 1,2
order by 3 desc

View conversions

Your conversions data is included in your Data Pipeline data! Use the template below to get you started with the conversions data.

select
    *
from parsely_data_pipeline_table_name
where
    year = 'yyyy' and --this makes the query cheaper!
    month = 'mm' and --this makes the query cheaper!
    action = 'conversion'

#Use dbt and a pre-formatted star schema to organize the Data Pipeline data in Redshift

Coming soon! Contact Parse.ly Support to let us know you're interested in receiving updates on this effort!

Ready to get started with Parse.ly?

Questions?

Or

rocket emoji