Debugging Bad Rows in Athena: A Practical Guide for Snowplow Users
Snowplow’s non-lossy pipeline architecture is a key differentiator for modern data teams. Instead of silently dropping malformed or invalid events, Snowplow preserves them as bad rows, complete with descriptive error messages that help you trace the root cause of data quality issues.
When rolling out new tracking or monitoring the integrity of your pipelines over time, these bad rows become invaluable. In this tutorial, we walk through how to query and debug Snowplow bad rows using Amazon Athena, a fast, serverless, SQL-based query service that makes analyzing data in S3 simple.
Target audience: This guide is written for data engineers, architects, and analytics professionals already familiar with AWS, Snowplow, and SQL.
Why Debugging Bad Rows Matters
Bad rows contain structured information about failed validation checks during Snowplow's enrichment process. These failures can result from:
- Schema mismatches
- Unexpected or malformed field values
- Improperly structured payloads
- Out-of-date tracking configurations
Instead of discarding this data, Snowplow stores it in a dedicated S3 path (typically enriched/bad). Analyzing this data allows you to improve tracking reliability and maintain a high standard of data quality.
Step 1: Create an Athena Table for Bad Rows
First, configure a table in Athena that maps to your bad row structure in S3. Navigate to the Athena console, and execute the following SQL, updating the LOCATION with your Snowplow bad rows S3 path:
CREATE EXTERNAL TABLE IF NOT EXISTS bad_rows (
line string,
errors array<struct<level: string, message: string>>,
failure_tstamp string
)
PARTITIONED BY (
run string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE
LOCATION 's3://your-bucket/path/to/enriched/bad/';
Then load all partitions:
MSCK REPAIR TABLE bad_rows;
⚠️ If you're working with large volumes of data, consider manually adding specific partitions instead to reduce scan costs and query time (see below).
Step 2: Analyze and Investigate Bad Rows
Count Bad Rows by Run
To monitor bad row volume and spot anomalies:
SELECT
run,
COUNT(*) AS bad_rows_count
FROM bad_rows
GROUP BY run
ORDER BY run;
This gives you a timeline of error frequency by pipeline run.
Preview a Sample of Bad Rows
To get a feel for the kinds of errors occurring:
SELECT
errors
FROM bad_rows
WHERE run > '2025-01-01'
LIMIT 100;
Use filters on run to limit scan size and reduce Athena costs.
Filter Out Known Benign Errors
You may want to exclude common, non-critical validation failures:
SELECT
errors[1].message
FROM bad_rows
WHERE run > '2025-01-01'
AND position('does not match (/)vendor/version(/) pattern nor is a legacy' in errors[1].message) = 0
AND position('Unrecognized event [null]' in errors[1].message) = 0
LIMIT 100;
These filters remove routine schema version and null event name errors that don't need immediate action.
Isolate and Measure Specific Errors
To identify and quantify a specific recurring issue, such as extra properties not defined in the schema:
SELECT
run,
COUNT(*)
FROM bad_rows
WHERE run > '2025-01-01'
AND position('object instance has properties which are not allowed by the schema: ["urgency"]' in errors[1].message) > 0
GROUP BY run
ORDER BY run;
This lets you measure the impact of a bug over time.
Step 3: Debug Individual Bad Events
To inspect the raw line associated with an error:
SELECT
line
FROM bad_rows
WHERE run > '2025-01-01'
AND position('object instance has properties which are not allowed by the schema: ["urgency"]' in errors[1].message) > 0
LIMIT 100;
From here, you can base64 decode the line payload to trace the error back to the original tracker or source system.
Step 4: Filter Known Issues and Focus on New Errors
As you resolve common issues, refine your filters:
SELECT
errors[1].message
FROM bad_rows
WHERE run > '2025-01-01'
AND position('does not match (/)vendor/version(/) pattern nor is a legacy' in errors[1].message) = 0
AND position('Unrecognized event [null]' in errors[1].message) = 0
AND position('object instance has properties which are not allowed by the schema: ["urgency"]' in errors[1].message) = 0
LIMIT 100;
This enables ongoing triage and helps your team prioritize remaining unknown or unexpected errors.
Manually Adding Partitions in Athena
When working with large datasets, MSCK REPAIR TABLE can be time-consuming. Instead, add individual partitions manually:
ALTER TABLE bad_rows ADD PARTITION (run='2025-05-19-12-00-00') LOCATION 's3://your-bucket/path/to/enriched/bad/';
Query the partitioned data directly:
SELECT
COUNT(*),
errors
FROM bad_rows
WHERE run = '2025-05-19-12-00-00'
GROUP BY errors
ORDER BY COUNT(*) DESC;
Beyond Athena: Visualizing and Recovering Bad Rows
Athena is only the start. Here are further enhancements:
Visualization Tools
- Amazon QuickSight or Redash: Both support Athena and can be used to visualize trends and error categories over time.
- Snowplow Console (Managed Service): Offers built-in monitoring for bad rows and schema violations (depending on plan).
Recovery Options
- Hadoop Event Recovery: Useful for reprocessing events with known fixes (e.g., missing or incorrect schema references).
- Custom JS recovery functions: Many teams develop and reuse simple JS functions to correct common error patterns before re-ingesting.
Final Thoughts
Snowplow’s commitment to data transparency through preserved bad rows empowers teams to build more trustworthy data pipelines. Athena makes it easy to explore and triage these bad rows, reducing time-to-insight when something goes wrong.
By combining Athena queries, robust filtering, and Snowplow's structured error messages, your data team can maintain a reliable, high-integrity event stream — even as tracking evolves and scales.