Handling invalid timestamps in your Parse.ly data pipeline
When using the Parse.ly data pipeline, you may occasionally see events with odd or invalid timestamps. These timestamps are generated by a reader’s web browser, and can be distorted in the case of a user device with bad clock settings. In your data pipeline events, they’ll show up in the session_ts_current and/or session_ts_previous fields.
While we can’t control the value of these fields generated by user browsers, they can be fixed before you load them into your database using ETL checks and transforms. At a high level, the way of doing that would be as follows:
- When processing events as part of your ETL pipeline, include a function to compare the value of session_current_ts (the browser timestamp) to timestamp_info_nginx_ms (the server timestamp).
- If a significant gap exists (the exact time interval is up to you), run a transformation on session_current_ts to create a new field. The field name is up to you; in this example, we’ll call it session_current_ts_trans. Set the transformed field equal to the min(timestamp_info_nginx_ms) for events in your table with the same session_id, visitor_site_id, and apikey values as the current event.
- For events that fall within your acceptable time range, run the transformation to set session_current_ts_trans = session_current_ts.
Additional transformations that require the browser timestamp should then rely on the session_current_ts_trans field.