Snowplow Python Tracker 0.4.0 released
We are happy to announce the release of the Snowplow Python Tracker version 0.4.0.
This version introduces the Subject class, which lets you keep track of multiple users at once, and several Emitter classes, which let you send events asynchronously, pass them to a Celery worker, or even send them to a Redis database. We have added support for sending batches of events in POST requests, although the Snowplow collectors do not yet support POST requests.
We have also made changes to the format of unstructured events and custom contexts, to support our new work around self-describing JSON Schemas.
In the rest of the post we will cover:
- The Subject class
- The Emitter classes
- Tracker method return values
- Logging
- Pycontracts
- The RedisWorker class
- Self-describing JSONs
- Upgrading
- Support
1. The Subject class
An instance of the Subject class represents a user who is performing an event in the Subject–Verb–Direct Object model proposed in our Snowplow event grammar. Although you can create a Tracker instance without a Subject, you won’t be able to add information such as user ID and timezone to your events without one.
If you are tracking more than one user at once, create a separate Subject instance for each. An example:
It is also possible to set the subject during Tracker initialization:
2. The Emitter classes
Trackers must be initialized with an Emitter.
This is the signature of the constructor for the base Emitter class:
The only field which must be set is the endpoint
, which is the collector to which the emitter logs events. port
is the port to connect to, protocol
is either "http"
or "https"
, and method
is either “get” or “post”.
When the emitter receives an event, it adds it to a buffer. When the queue is full, all events in the queue get sent to the collector. The buffer_size argument allows you to customize the queue size. By default, it is 1 for GET requests and 10 for POST requests. If the emitter is configured to send POST requests, then instead of sending one for every event in the buffer, it will send a single request containing all those events in JSON format.
on_success is an optional callback that will execute whenever the queue is flushed successfully, that is, whenever every request sent has status code 200. It will be passed one argument: the number of events that were sent.
on_failure
is similar, but executes when the flush is not wholly successful. It will be passed two arguments: the number of events that were successfully sent, and an array of unsent requests.
AsyncEmitter
The AsyncEmitter class works just like the base Emitter class, but uses threads, allowing it to send HTTP requests in a non-blocking way.
CeleryEmitter
The CeleryEmitter class works just like the base Emitter class, but it registers sending requests as a task for a Celery worker. If there is a module named snowplow_celery_config.py on your PYTHONPATH
, it will be used as the Celery configuration file; otherwise, a default configuration will be used. You can run the worker using this command:
Note that on_success
and on_failure
callbacks cannot be supplied to this emitter.
RedisEmitter
Use a RedisEmitter instance to store events in a Redis database for later use. This is the RedisEmitter constructor function:
rdb
should be an instance of either the Redis
or StrictRedis
class, found in the redis module. If it is not supplied, a default will be used. key
is the key used to store events in the database. It defaults to “snowplow”. The format for event storage is a Redis list of JSON strings.
Flushing
You can flush the buffer of an emitter associated with a tracker instance t
like this:
This synchronously sends all events in the emitter’s buffer.
Custom emitters
You can create your own custom emitter class, either from scratch or by subclassing one of the existing classes. The only requirement for compatibility is that is must have an input
method which accepts a Python dictionary of name-value pairs.
3. Tracker method return values
If you are using the synchronous Emitter and call a tracker method which causes the emitter to send a request, that tracker method will return the status code for the request:
This is useful for initial testing.
Otherwise, the tracker method will return the tracker instance, allowing tracker methods to be chained:
The set_subject
method will always return the Tracker instance.
4. Logging
The emitters.py module has Python logging turned on. The logger prints messages about what emitters are doing. By default, only messages with priority “INFO” or higher will be logged.
To change this:
5. Pycontracts
The Snowplow Python Tracker uses the Pycontracts module for type checking. The option to turn type checking off has been moved out of Tracker construction:
Switch off Pycontracts to improve performance in production.
6. The RedisWorker class
The tracker comes with a RedisWorker class which sends Snowplow events from Redis to an emitter. The RedisWorker constructor is similar to the RedisEmitter constructor:
This is how it is used:
This will set up a worker which will run indefinitely, taking events from the Redis list with key “snowplow_redis_key” and inputting them to an AsyncEmitter, which will send them to a Collector. If the process receives a SIGINT signal (for example, due to a Ctrl-C keyboard interrupt), cleanup will occur before exiting to ensure no events are lost.
7. Self-describing JSONs
Snowplow unstructured events and custom contexts are now defined using JSON schema, and should be passed to the Tracker using self-describing JSONs. Here is an example of the new format for unstructured events:
The data
field contains the actual properties of the event and the schema
field points to the JSON schema against which the contents of the data
field should be validated. The data
field should be flat, rather than nested.
Custom contexts work similarly. Since and event can have multiple contexts attached, the contexts
argument of each trackXXX
method must (if provided) be a non-empty array:
The above example shows a page view event with two custom contexts attached: one describing the page and another describing the user.
As part of this change we have also removed type hint suffices from unstructured events and custom contexts. Now that JSON schemas are responsible for type checking, there is no need to include types a a part of field names.
8. Upgrading
The release version of this tracker (0.4.0) is available on PyPI, the Python Package Index repository, as snowplow-tracker. Download and install it with pip:
Or with setuptools:
For more information on getting started with the Snowplow Python Tracker, see the setup page.
9. Support
Please get in touch if you need help setting up the Snowplow Python Tracker or want to suggest a new feature. The Snowplow Python Tracker is still young, so of course do raise an issue if you find any bugs.
For more details on this release, please check out the 0.4.0 Release Notes on GitHub.