Snowplow Python Analytics SDK 0.1.0 released
Following in the footsteps of the Snowplow Scala Analytics SDK, we are happy to announce the release of the Snowplow Python Analytics SDK! This library makes your Snowplow enriched events easier to work with in Python-compatible data processing frameworks such as Apache Spark and AWS Lambda.
Some good use cases for the SDK include:
- Performing event data modeling in PySpark as part our Hadoop batch pipeline
- Developing machine learning models on your event data using PySpark (e.g. using Databricks)
- Performing analytics-on-write in AWS Lambda as part of our Kinesis real-time pipeline:
Read on below the jump for:
Snowplow’s ETL process outputs enriched events in a TSV. This TSV currently has 131 fields, which can make it difficult to work with directly. The Snowplow Python Analytics SDK currently supports one transformation: turning this TSV into a more tractable JSON.
The transformation algorithm used to do this is the same as the one used in the Kinesis Elasticsearch Sink and the Snowplow Scala Analytics SDK, with one exception: when a field of the input TSV is empty, we leave that field out of the output JSON entirely rather than using a field with the value
null. Here is an example output JSON:
There are special rules for how custom contexts and unstructured events are added to the JSON. For example, if an enriched event contained a
com.snowplowanalytics.snowplow/link_click/jsonschema/1-0-1 unstructured event, then the final JSON would contain:
For more examples and detail on the algorithm used, check out the Kinesis Elasticsearch Sink wiki page.
The SDK is available on PyPI:
Use the SDK like this:
If there are any problems in the input TSV (such as unparseable JSON fields or numeric fields), the
transform method will throw a
SnowplowEventTransformationException. This exception contains a list of error messages – one for every problematic field in the input.
For more information, check out the Python Analytics SDK wiki page.
4. Getting help
And if there’s another Snowplow Analytics SDK you’d like us to prioritize creating, please let us know on the forums!