Debugging Bad Rows in Athena: A Practical Guide for Snowplow Users

Snowplow Team

September 25, 2024

Share this post

Snowplow’s non-lossy pipeline architecture is a key differentiator for modern data teams. Instead of silently dropping malformed or invalid events, Snowplow preserves them as bad rows, complete with descriptive error messages that help you trace the root cause of data quality issues.

When rolling out new tracking or monitoring the integrity of your pipelines over time, these bad rows become invaluable. In this tutorial, we walk through how to query and debug Snowplow bad rows using Amazon Athena, a fast, serverless, SQL-based query service that makes analyzing data in S3 simple.

Target audience: This guide is written for data engineers, architects, and analytics professionals already familiar with AWS, Snowplow, and SQL.

Why Debugging Bad Rows Matters

Bad rows contain structured information about failed validation checks during Snowplow's enrichment process. These failures can result from:

Schema mismatches
Unexpected or malformed field values
Improperly structured payloads
Out-of-date tracking configurations

Instead of discarding this data, Snowplow stores it in a dedicated S3 path (typically enriched/bad). Analyzing this data allows you to improve tracking reliability and maintain a high standard of data quality.

Step 1: Create an Athena Table for Bad Rows

First, configure a table in Athena that maps to your bad row structure in S3. Navigate to the Athena console, and execute the following SQL, updating the LOCATION with your Snowplow bad rows S3 path:

CREATE EXTERNAL TABLE IF NOT EXISTS bad_rows (

line string,

errors array<struct<level: string, message: string>>,

failure_tstamp string

)

PARTITIONED BY (

run string

)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

STORED AS TEXTFILE

LOCATION 's3://your-bucket/path/to/enriched/bad/';

Then load all partitions:

MSCK REPAIR TABLE bad_rows;

⚠️ If you're working with large volumes of data, consider manually adding specific partitions instead to reduce scan costs and query time (see below).

Step 2: Analyze and Investigate Bad Rows

Count Bad Rows by Run

To monitor bad row volume and spot anomalies:

SELECT

run,

COUNT(*) AS bad_rows_count

FROM bad_rows

GROUP BY run

ORDER BY run;

This gives you a timeline of error frequency by pipeline run.

Preview a Sample of Bad Rows

To get a feel for the kinds of errors occurring:

SELECT

errors

FROM bad_rows

WHERE run > '2025-01-01'

LIMIT 100;

Use filters on run to limit scan size and reduce Athena costs.

Filter Out Known Benign Errors

You may want to exclude common, non-critical validation failures:

SELECT

errors[1].message

FROM bad_rows

WHERE run > '2025-01-01'

AND position('does not match (/)vendor/version(/) pattern nor is a legacy' in errors[1].message) = 0

AND position('Unrecognized event [null]' in errors[1].message) = 0

LIMIT 100;

These filters remove routine schema version and null event name errors that don't need immediate action.

Isolate and Measure Specific Errors

To identify and quantify a specific recurring issue, such as extra properties not defined in the schema:

SELECT

run,

COUNT(*)

FROM bad_rows

WHERE run > '2025-01-01'

AND position('object instance has properties which are not allowed by the schema: ["urgency"]' in errors[1].message) > 0

GROUP BY run

ORDER BY run;

This lets you measure the impact of a bug over time.

Step 3: Debug Individual Bad Events

To inspect the raw line associated with an error:

SELECT

line

FROM bad_rows

WHERE run > '2025-01-01'

AND position('object instance has properties which are not allowed by the schema: ["urgency"]' in errors[1].message) > 0

LIMIT 100;

From here, you can base64 decode the line payload to trace the error back to the original tracker or source system.

Step 4: Filter Known Issues and Focus on New Errors

As you resolve common issues, refine your filters:

SELECT

errors[1].message

FROM bad_rows

WHERE run > '2025-01-01'

AND position('does not match (/)vendor/version(/) pattern nor is a legacy' in errors[1].message) = 0

AND position('Unrecognized event [null]' in errors[1].message) = 0

AND position('object instance has properties which are not allowed by the schema: ["urgency"]' in errors[1].message) = 0

LIMIT 100;

This enables ongoing triage and helps your team prioritize remaining unknown or unexpected errors.

Manually Adding Partitions in Athena

When working with large datasets, MSCK REPAIR TABLE can be time-consuming. Instead, add individual partitions manually:

ALTER TABLE bad_rows ADD PARTITION (run='2025-05-19-12-00-00') LOCATION 's3://your-bucket/path/to/enriched/bad/';

Query the partitioned data directly:

SELECT

COUNT(*),

errors

FROM bad_rows

WHERE run = '2025-05-19-12-00-00'

GROUP BY errors

ORDER BY COUNT(*) DESC;

Beyond Athena: Visualizing and Recovering Bad Rows

Athena is only the start. Here are further enhancements:

Visualization Tools

Amazon QuickSight or Redash: Both support Athena and can be used to visualize trends and error categories over time.
Snowplow Console (Managed Service): Offers built-in monitoring for bad rows and schema violations (depending on plan).

Recovery Options

Hadoop Event Recovery: Useful for reprocessing events with known fixes (e.g., missing or incorrect schema references).
Custom JS recovery functions: Many teams develop and reuse simple JS functions to correct common error patterns before re-ingesting.

Final Thoughts

Snowplow’s commitment to data transparency through preserved bad rows empowers teams to build more trustworthy data pipelines. Athena makes it easy to explore and triage these bad rows, reducing time-to-insight when something goes wrong.

By combining Athena queries, robust filtering, and Snowplow's structured error messages, your data team can maintain a reliable, high-integrity event stream — even as tracking evolves and scales.

Subscribe to our newsletter

Get the latest content to your inbox monthly.

Debugging Bad Rows in Athena: A Practical Guide for Snowplow Users

Get Started

Get Started

Get Started

Products

Comparisons

Customers

Solutions

Explore

Integrations

Technology

Resources

Follow Us