Modelling page view events as a graph

In the previous post in this series we started exploring options for modelling event data as a graph in general. We looked at three ways of modelling atomic event data:
- The event grammar approach.
- The event graph approach.
- The denormalised graph approach.
Ultimately we chose to model events as a denormalised graph, where the same things are represented multiple times in different ways (eg as both a node and a relationship). That adds redundancy to the model but makes querying it easier.
In our chosen model, we also made use of reification, or the turning of relationships into things. In our event grammar, events naturally exist in the relationships between entities. However, in the graph model, each event will be its own node with outgoing HAS
relationships to its properties, for instance (event)-[:HAS]->(user)
. Events will be connected to each other by a NEXT
relationship. There will also be relationships between the various properties of the event, such as (user)-[:VIEWS]->(page)
.
In this post, we will explore how we could model a page_view
event with all its expected properties for the Snowplow pipeline. We’ll discuss creating the necessary schemas and implementing them in code.
In this post we’ll cover:
- Self-describing events
- Custom contexts
- Composable schemas
- Identifying entities for the
page_view
event - Drawing the relationships
- Composing the
page_view
schema - Implementing the contract in code
Self-describing events
In Neo4j and other graph databases, nodes, relationships and patterns are schemaless. This means the database does not distinguish between different types of nodes, relationships or patterns. It’s up to developers to enforce any types.
At Snowplow Analytics, our preferred way of establishing the contract between the app sending the events and any process that consumes them (for validation, enrichment, loading, modelling, etc) is through schemas. However, typically, a Snowplow-authored self-describing JSON Schema will not describe an entire event. A couple of examples illustrate this point.
Here is the JSON schema for a submit_form
event from Iglu Central:
{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for a form submission event", "self": { "vendor": "com.snowplowanalytics.snowplow", "name": "submit_form", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "formId": { "type": "string" }, "formClasses": { "type": "array", "items": { "type": "string" } }, "elements": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string" }, "value": { "type": ["string", "null"] }, "nodeName": { "enum": ["INPUT", "TEXTAREA", "SELECT"] }, "type": { "enum": ["button", "checkbox", "color", "date", "datetime", "datetime-local", "email", "file", "hidden", "image", "month", "number", "password", "radio", "range", "reset", "search", "submit", "tel", "text", "time", "url", "week"] } }, "required": ["name", "value", "nodeName"], "additionalProperties": false } } }, "required": ["formId"], "additionalProperties": false }
This schema only covers some of the entities and relationships that constitute the submit_form
event. The data that matches this schema will end up as just one of many fields in the enriched event.
Custom contexts
It’s a similar story with any of our custom entity context schemas. In fact, all contexts sent with the event only populate a single contexts
field in the enriched event; and the event transformer then parses out each context into its own column or table (depending on where the data will be stored).
{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for custom contexts", "self": { "vendor": "com.snowplowanalytics.snowplow", "name": "contexts", "format": "jsonschema", "version": "1-0-1" }, "type": "array", "items": { "type": "object", "properties": { "schema": { "type": "string", "pattern": "^iglu:[a-zA-Z0-9-_.]+/[a-zA-Z0-9-_]+/[a-zA-Z0-9-_]+/[0-9]+-[0-9]+-[0-9]+$" }, "data": {} }, "required": ["schema", "data"], "additionalProperties": false } }
And for some events, such as page_view
, there isn’t a specific JSON schema. The Elasticsearch enriched event schema probably comes closest, being a projection of the canonical event model (of which the atomic.events
table in Redshift is another projection). But that schema also applies to the other events in the canonical model.
Composable schemas
For this project we’d like to explore a new approach to building schemas, based on individual schemas for each of the event’s components, be they nodes or relationships. The schema for the event itself will then reference those building blocks through JSON pointers. This will allow us to reuse entities and relationships across all events they are part of. It will also make maintaining the schemas much easier. For example, if an entity such as a user is part of many events and we want to add a new property to the User
node, we’ll only have to make one change: in the user
schema. (Currently our self-describing JSON schemas do not support JSON pointers, so this functionality will have to be added if we decide to go with this approach.)
To illustrate how this could work, let’s look at a simplified page_view
definition, consisting of two nodes – User
and Page
– connected by a VISITED
relationship.
We start by defining the schemas for each of the two nodes and the relationship:
{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0
-0#", "description": "Schema for a User node", "self": { "vendor": "com.snowplowanalytics.snowplow", "name": "user", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "email": { "type": "string" } }, "additionalProperties": false }, { "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for a Page node", "self": { "vendor": "com.snowplowanalytics.snowplow", "name": "page", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "pageUrl": { "type": "string" } }, "additionalProperties": false }, { "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for a VISITED relationship", "self": { "vendor": "com.snowplowanalytics.snowplow", "name": "visited", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "derivedTstamp": { "type": "string", "format": "date-time" }, "startNode": { "$ref": "iglu:com.snowplowanalytics.graph/user/jsonschema/1-0-0" }, "endNode": { "$ref": "iglu:com.snowplowanalytics.graph/page/jsonschema/1-0-0" } }, "additionalProperties": false }
We can then use these schemas to compose a schema for the page_view
event:
{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for a page view event", "self": { "vendor": "com.snowplowanalytics.snowplow", "name": "page_view", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "nodes": { "type": "object", "properties": { "user": { "$ref": "iglu:com.snowplowanalytics.graph/user/jsonschema/1-0-0" }, "page": { "$ref": "iglu:com.snowplowanalytics.graph/page/jsonschema/1-0-0" } } }, "relationships": { "type": "object", "properties": { "visited": { "$ref": "iglu:com.snowplowanalytics.graph/visited/jsonschema/1-0-0" } } } }, "additionalProperties": false }
Identifying entities for the page_view
event
Let’s figure out which atomic
properties should go into the page_view
event. The atomic.events
table is pretty wide but all fields are not required for a page_view
. For example, all fields that start with the tr_
prefix only apply to transaction
events.
After removing all the irrelevant fields, we end up with this list of properties. I’ve broken them down into rough categories for ease of reading:
Application
- app_id
- platform
Browser
- br_colordepth
- br_cookies
- br_family
- br_features_director
- br_features_flash
- br_features_gears
- br_features_java
- br_features_pdf
- br_features_quicktime
- br_features_realplayer
- br_features_silverlight
- br_features_windowsmedia
- br_lang
- br_name
- br_renderengine
- br_type
- br_version
- br_viewheight
- br_viewwidth
- user_fingerprint
Timestamps
- collector_tstamp
- derived_tstamp
- dvce_created_tstamp
- dvce_sent_tstamp
- etl_tstamp
- true_tstamp
Document
- doc_charset
- doc_height
- doc_width
Session
- domain_sessionid
- domain_sessionidx
Device
- domain_userid
- dvce_ismobile
- dvce_screenheight
- dvce_screenwidth
- dvce_type
- network_userid
- refr_domain_userid
- useragent
- refr_dvce_tstamp
OS
- os_family
- os_manufacturer
- os_name
- os_timezone
Infrastructure metadata
- etl_tags
- name_tracker
- v_collector
- v_etl
- v_tracker
Event metadata
- event_fingerprint
- event_format
- event_id
- event
- event_name
- event_vendor
- event_version
Geo
- geo_city
- geo_country
- geo_latitude
- geo_longitude
- geo_region
- geo_region_name
- geo_timezone
- geo_zipcode
Network
- user_ipaddress
- ip_domain
- ip_isp
- ip_netspeed
- ip_organization
Traffic source
Campaign attribution
- mkt_campaign
- mkt_clickid
- mkt_content
- mkt_medium
- mkt_network
- mkt_source
- mkt_term
Referrer
- page_referrer
- refr_medium
- refr_source
- refr_term
- refr_urlfragment
- refr_urlhost
- refr_urlpath
- refr_urlport
- refr_urlquery
- refr_urlscheme
Page
- page_title
- page_url
- page_urlfragment
- page_urlhost
- page_urlpath
- page_urlport
- page_urlquery
- page_urlscheme
User
- user_id
That’s a lot of properties! To make the graph a bit simpler and more manageable, it’s best to model most of them as properties of the nodes and relationships that constitute the page_view
event, rather than as nodes in their own right. To help us decide which fields should be nodes and which ones should be node properties, we’ll fall back on another idea from our event grammar.
In the real world as in software, entities change over time. If we can view our entities as consisting of properties, we can divide these properties into three approximate buckets:
- Permanent or static properties: properties that don’t change over the lifetime of the entity. For example, my first language;
- Infrequently changing properties: properties that change infrequently, for instance my email address;
- Frequently changing properties: properties that frequently change. For example, my geographical location.
The exact taxonomy is not set in stone. For example, a person’s height will start as a frequently changing property, become a static property in adulthood, and then switch to an infrequently changing property as she grows older. Nor do these properties necessarily change in particularly directional or meaningful ways over time: my geographical location has patterns in it (such as my daily commute), but there’s no overarching trend across all the data.
Which bucket a property falls into also depends on which we node we choose to assign it to. For example, the geographical location is a frequently changing property if it’s assigned to me, the user; but it’s a static or infrequently changing property if it’s assigned to the network that I’m using.
Let’s try to model frequently changing properties as nodes; and static / stable properties as node properties. Also, we can use the categories outlined above as a broad indication of what should be a node, eg we’ll have an Application
node with all the associated properties, etc:
Node | Categories |
---|---|
Application | Application |
Browser | Browser |
Device | Device, OS |
Session | Session |
Page | Page, Document |
User | User |
Network | Network, Geo |
Source | Traffic source |
Event | Event metadata, Infrastructure metadata, Timestamps |
We’ll model the different timestamps as properties of the Event
node. We can also model them as properties on the relationships within the event.
You can find all of our proposed node schemas in this GitHub repo.
Drawing the relationships
The next step is to figure out all the relationships between our nodes. We already know what two of them will be:
(Event)-[:NEXT]->(Event)
(Event)-[:HAS]->(Non-event node)
.
We can consider all possible combinations of two nodes from the list above and see if any “natural” relationships exist between them. In that way, we can expand the list of relationships. To keep things simple for now, let’s assume all constituent nodes are linked by a [:WITH]
relationship, eg (Application)<-[:WITH]-(Browser)
, (Browser)<-[:WITH]-(Session)
. Admittedly, this is not very insightful for the casual observer. We will revisit relationships in a future post to expand on that. For now, we’re just using the [:WITH]
relationship to denormalise the graph, so we can easily run queries on it.
The proposed schemas for these relationships are in this repo.
Composing the page_view
schema
We now have all the building blocks to compose a schema for the page_view
event. You can find the schema here.
If we use this model to represent a series of page_view
events, we end up with a graph that looks something like this (most nodes are omitted, for clarity):
Event nodes (in grey) are connected via NEXT
relationships. Each one of them has outgoing HAS
relationships to constituent nodes, such as Page
(magenta) and User
(blue). In turn, the constituent nodes are linked via WITH
relationships.
Now we can easily query all the pages visited by a user, without having to go via the event node thanks to the [:WITH]
relationships:
MATCH (u:User)-[:WITH]-(p:Page) WHERE u.user_id = 'someone@mail.com' RETURN p.page_url;
And we can also easily query the path a user has taken through the website:
MATCH p1 = (u:User)<-[:HAS]-(e:Event) WHERE u.user_id = 'someone@mail.com' WITH p1 MATCH path = (e)-[:NEXT*1..5]-() RETURN path;
Implementing the contract in code
Now that we’ve schemaed everything, we can think about how we can implement this in actual code.
A great thing about schemas is that the code falls naturally out of them. For example, here’s how we might represent the Application
node in Scala, along with a method that will create the node in the database in pseudocode:
class Node: def create(**options) class Application(Node): def create(app_id, platform): statement = "MERGE (application:Application {app_id: $app_id, platform: $platform})" neo4j.execute(statement, {'platform': platform, 'app_id': app_id})
Next up in the series
We take a break from trying to model events as a graph and explore another use case for graphs in event analytics – identity resolution.