How to Query Shredded Snowplow Events Using AWS Athena: A Comprehensive Guide
What is AWS Athena and Why Use It for Shredded Events?
AWS Athena is a serverless query service that allows you to run SQL queries directly on data stored in S3 without the need for complex data loading or ETL processes. While Athena is often used to query ‘bad’ data (data that fails validation), it also presents a cost-effective alternative for querying ‘good’ Snowplow data.
For Snowplow users managing large volumes of data with variable query workloads, Athena is particularly attractive because it eliminates the need for a permanent Redshift cluster. This post focuses on querying the shredded data rather than the enriched data in S3.
How Is Snowplow Data Stored in S3?
Snowplow stores processed data in S3 in two main formats:
- Enriched Data:
- Tab-delimited text files.
- Structured similarly to the atomic.events table in Redshift.
- Self-describing events and custom contexts are stored as JSON strings in specific columns.
- Tab-delimited text files.
- Shredded Data:
- New-line delimited JSON files.
- Each event and context is shredded into its own table, optimizing for efficient loading and querying in Redshift.
- New-line delimited JSON files.
Why Query Shredded Data in Athena Instead of Redshift?
- Cost Efficiency: No need for a persistent Redshift cluster. You pay only for the data scanned.
- Scalability: Ideal for variable query workloads.
- Data Exploration: Quickly validate new events or contexts without creating Redshift tables.
How to Create a Table for Shredded Data in Athena?
To query shredded data in Athena, each shredded table requires its own DDL (Data Definition Language). Below, we outline the steps using the webPage context table as an example.
Example Row from Shredded Data:
{
"schema": {
"vendor": "com.snowplowanalytics.snowplow",
"name": "web_page",
"format": "jsonschema",
"version": "1-0-0"
},
"data": {
"id": "ef499475-70c6-4ad9-a289-82b953adaa20"
},
"hierarchy": {
"rootId": "c1b6e9c2-b04f-4b80-a993-9e37503fde08",
"rootTstamp": "2017-08-04 08:51:19.000",
"refRoot": "events",
"refTree": ["events", "web_page"],
"refParent": "events"
}
}
DDL for Creating the Shredded Table in Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS com_snowplowanalytics_snowplow_web_page (
schema STRUCT <
vendor: STRING,
name: STRING,
format: STRING,
version: STRING
>,
data STRUCT <
id: STRING
>,
hierarchy STRUCT <
rootId: STRING,
rootTstamp: TIMESTAMP,
refRoot: STRING,
refTree: STRING,
refParent: STRING
>
)
PARTITIONED BY(run STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
STORED AS TEXTFILE
LOCATION 's3://your-bucket/shredded/good/';
Load Partitions:
MSCK REPAIR TABLE com_snowplowanalytics_snowplow_web_page;
Note: This step can take considerable time, as Athena scans all files under the specified location to identify new partitions.
What Are the Key Challenges When Querying Shredded Data in Athena?
- Partitioning Issues: Shredded data is partitioned by run timestamp, vendor, and schema version. This can lead to unwanted data being loaded into the table, resulting in empty rows or irrelevant events.
- Data Filtering: Athena lacks sophisticated partition pruning, leading to higher query costs if irrelevant data is not filtered out effectively.
How to Clean Up the Data in Athena?
To eliminate empty rows caused by irrelevant data, run this query:
SELECT *
FROM com_snowplowanalytics_snowplow_web_page
WHERE schema IS NOT NULL
AND data IS NOT NULL
AND hierarchy IS NOT NULL
LIMIT 10;
Filtering by Schema Name:
To specifically target web_page events:
SELECT *
FROM com_snowplowanalytics_snowplow_web_page
WHERE schema.name = 'web_page'
LIMIT 10;
Extracting Specific Fields in Athena
Since the columns are JSON structures, extracting specific fields requires specifying the JSON key. For example, to extract the page_view_id:
SELECT data.id
FROM com_snowplowanalytics_snowplow_web_page
WHERE schema.name = 'web_page'
LIMIT 10;
How to Structure the Data to Match Redshift’s Atomic Events Table?
Athena’s table structure will differ from the atomic.events table in Redshift. To align the two, use a transformation query:
SELECT
schema.vendor AS schema_vendor,
schema.name AS schema_name,
schema.format AS schema_format,
schema.version AS schema_version,
hierarchy.rootId AS root_id,
hierarchy.rootTstamp AS root_tstamp,
hierarchy.refRoot AS ref_root,
hierarchy.refTree AS ref_tree,
hierarchy.refParent AS ref_parent,
data.id AS page_view_id
FROM com_snowplowanalytics_snowplow_web_page
WHERE schema.name = 'web_page'
LIMIT 10;
Joining Shredded Data with Atomic Events in Athena
To join the shredded web_page table with the atomic.events table:
SELECT
a.event_id,
wp.data.id AS page_view_id
FROM atomic_events AS a
JOIN com_snowplowanalytics_snowplow_web_page AS wp
ON a.event_id = wp.hierarchy.rootId
WHERE wp.hierarchy.refTree = '["events","web_page"]'
LIMIT 10;
Tip: Avoid joining on timestamps unless they are properly formatted and aligned between both tables.
Next Steps and Recommendations
- Optimize Partitioning: Consider restructuring data in S3 to isolate events and contexts into separate partitions. This reduces scan costs and query times.
- Testing and Debugging: Experiment with Athena’s partitioning strategies to identify the most efficient way to store shredded data.
- Stay Updated: Snowplow continues to evolve its data structure and querying capabilities. Keep an eye on new releases for optimizations related to Athena.
Conclusion
AWS Athena is a powerful tool for querying shredded Snowplow data, offering cost-effective data exploration without the need for a dedicated Redshift cluster. While challenges exist around data partitioning and filtering, adopting best practices in data structuring and query optimization can significantly enhance performance and reduce costs.