Blog

How to Query Shredded Snowplow Events Using AWS Athena: A Comprehensive Guide

By
Snowplow Team
&
October 2, 2024
Share this post

What is AWS Athena and Why Use It for Shredded Events?

AWS Athena is a serverless query service that allows you to run SQL queries directly on data stored in S3 without the need for complex data loading or ETL processes. While Athena is often used to query ‘bad’ data (data that fails validation), it also presents a cost-effective alternative for querying ‘good’ Snowplow data.

For Snowplow users managing large volumes of data with variable query workloads, Athena is particularly attractive because it eliminates the need for a permanent Redshift cluster. This post focuses on querying the shredded data rather than the enriched data in S3.

How Is Snowplow Data Stored in S3?

Snowplow stores processed data in S3 in two main formats:

  1. Enriched Data:

    • Tab-delimited text files.
    • Structured similarly to the atomic.events table in Redshift.
    • Self-describing events and custom contexts are stored as JSON strings in specific columns.

  2. Shredded Data:

    • New-line delimited JSON files.
    • Each event and context is shredded into its own table, optimizing for efficient loading and querying in Redshift.

Why Query Shredded Data in Athena Instead of Redshift?

  • Cost Efficiency: No need for a persistent Redshift cluster. You pay only for the data scanned.
  • Scalability: Ideal for variable query workloads.
  • Data Exploration: Quickly validate new events or contexts without creating Redshift tables.

How to Create a Table for Shredded Data in Athena?

To query shredded data in Athena, each shredded table requires its own DDL (Data Definition Language). Below, we outline the steps using the webPage context table as an example.

Example Row from Shredded Data:

{
  "schema": {
    "vendor": "com.snowplowanalytics.snowplow",
    "name": "web_page",
    "format": "jsonschema",
    "version": "1-0-0"
  },
  "data": {
    "id": "ef499475-70c6-4ad9-a289-82b953adaa20"
  },
  "hierarchy": {
    "rootId": "c1b6e9c2-b04f-4b80-a993-9e37503fde08",
    "rootTstamp": "2017-08-04 08:51:19.000",
    "refRoot": "events",
    "refTree": ["events", "web_page"],
    "refParent": "events"
  }
}

DDL for Creating the Shredded Table in Athena:

CREATE EXTERNAL TABLE IF NOT EXISTS com_snowplowanalytics_snowplow_web_page (
  schema STRUCT    <
    vendor: STRING,
    name: STRING,
    format: STRING,
    version: STRING
  >,
  data STRUCT      <
    id: STRING
  >,
  hierarchy STRUCT <
    rootId: STRING,
    rootTstamp: TIMESTAMP,
    refRoot: STRING,
    refTree: STRING,
    refParent: STRING
  >
)
PARTITIONED BY(run STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
STORED AS TEXTFILE
LOCATION 's3://your-bucket/shredded/good/';

Load Partitions:

MSCK REPAIR TABLE com_snowplowanalytics_snowplow_web_page;

Note: This step can take considerable time, as Athena scans all files under the specified location to identify new partitions.

What Are the Key Challenges When Querying Shredded Data in Athena?

  • Partitioning Issues: Shredded data is partitioned by run timestamp, vendor, and schema version. This can lead to unwanted data being loaded into the table, resulting in empty rows or irrelevant events.

  • Data Filtering: Athena lacks sophisticated partition pruning, leading to higher query costs if irrelevant data is not filtered out effectively.

How to Clean Up the Data in Athena?

To eliminate empty rows caused by irrelevant data, run this query:

SELECT * 
FROM com_snowplowanalytics_snowplow_web_page 
WHERE schema IS NOT NULL 
  AND data IS NOT NULL 
  AND hierarchy IS NOT NULL 
LIMIT 10;

Filtering by Schema Name:

To specifically target web_page events:
SELECT * 
FROM com_snowplowanalytics_snowplow_web_page 
WHERE schema.name = 'web_page' 
LIMIT 10;

Extracting Specific Fields in Athena

Since the columns are JSON structures, extracting specific fields requires specifying the JSON key. For example, to extract the page_view_id:

SELECT data.id 
FROM com_snowplowanalytics_snowplow_web_page 
WHERE schema.name = 'web_page' 
LIMIT 10;

How to Structure the Data to Match Redshift’s Atomic Events Table?

Athena’s table structure will differ from the atomic.events table in Redshift. To align the two, use a transformation query:

SELECT
  schema.vendor AS schema_vendor,
  schema.name AS schema_name,
  schema.format AS schema_format,
  schema.version AS schema_version,
  hierarchy.rootId AS root_id,
  hierarchy.rootTstamp AS root_tstamp,
  hierarchy.refRoot AS ref_root,
  hierarchy.refTree AS ref_tree,
  hierarchy.refParent AS ref_parent,
  data.id AS page_view_id
FROM com_snowplowanalytics_snowplow_web_page
WHERE schema.name = 'web_page'
LIMIT 10;

Joining Shredded Data with Atomic Events in Athena

To join the shredded web_page table with the atomic.events table:

SELECT 
  a.event_id, 
  wp.data.id AS page_view_id 
FROM atomic_events AS a
JOIN com_snowplowanalytics_snowplow_web_page AS wp 
  ON a.event_id = wp.hierarchy.rootId 
WHERE wp.hierarchy.refTree = '["events","web_page"]' 
LIMIT 10;

Tip: Avoid joining on timestamps unless they are properly formatted and aligned between both tables.

Next Steps and Recommendations

  • Optimize Partitioning: Consider restructuring data in S3 to isolate events and contexts into separate partitions. This reduces scan costs and query times.

  • Testing and Debugging: Experiment with Athena’s partitioning strategies to identify the most efficient way to store shredded data.

  • Stay Updated: Snowplow continues to evolve its data structure and querying capabilities. Keep an eye on new releases for optimizations related to Athena.

Conclusion

AWS Athena is a powerful tool for querying shredded Snowplow data, offering cost-effective data exploration without the need for a dedicated Redshift cluster. While challenges exist around data partitioning and filtering, adopting best practices in data structuring and query optimization can significantly enhance performance and reduce costs.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Get Started

Whether you’re modernizing your customer data infrastructure or building AI-powered applications, Snowplow helps eliminate engineering complexity so you can focus on delivering smarter customer experiences.