Building a “Propensity to Convert” machine learning model with Snowpark and Snowplow web tracking – Part one
When a user first visits a website or application, you can begin to track their behavior. This behavior, however limited during a first visit, can then be used to predict their future actions.
One possible future action is a ‘conversion’, which could be defined as a purchase, sign-up, or subscription (amongst many, many other things). Understanding how likely a user us to ‘convert’ is invaluable to digital companies because it allows marketers to:
- Assess the effectiveness of different digital ad campaigns – especially where the users acquired might not have converted. However, it is already possible (based on their data) to predict that many are likely to.
- Identify an audience for retargeting; i.e. focus ad spend on those users who are likely to convert (and save money by not reacquiring users who are not likely to convert).
- Understand what parts of a user’s initial experience predict their conversion and what parts in the user journey are essential.
- Identify parts of the user journey that did not get as much attention as needed. For example, time spent browsing documentation is predictive, but the marketing team has neglected to invest in documentation, not realizing its critical role in driving conversion. It might highlight that some experiences make the user less likely to convert, and these need to be dealt with urgently.
First-touch attribution
The first touch approach to marketing attribution states that all credit for the user conversion goes to the first advertisement on the user journey. Last-touch attribution, on the other hand, assigns all credit to the last advertisement, whereas multi-touch attribution takes into account all interactions along the user journey.
First-touch attribution is less common than last- or multi-touch attribution [1], [2], and is generally used to measure the effectiveness of brand building campaigns or to assign credit in short sales cycles.
The following guide explores the impact of measuringbehavioral data during initial contact with the website on propensity model accuracy. The impact of these statistics is less researched since most web trackers do not collect them; i.e., Google Ad Campaign [3].
Snowplow for attribution
Snowplow web tracking includes engagement metrics out of the box:
- Scroll depth
- Engaged time (with page)
- Time tab open in browser
Other standard features of the out-of-the box trackers include referral source, temporal markers, marketing campaign ID, geo-data, and device details.
Modeling for churn propensity
Marketing teams need to identify the reasons behind projected conversions, which means that model interpretability is incredibly important. Because of this, we have to put limits on feature preprocessing methods and classifiers; algorithms that aggregate features during preprocessing (such as PCA), for example, will render feature importance unusable.
The chosen baseline is Logistic Regression [4], the most researched and widely spread in marketing attribution. Gradient Boosting Trees with LightGBM [5] often perform better than neural networks or linear models for highly categorical data sets. It also offers explainability through the decision tree visualization and SHAP TreeExplainer (after the number of features had been cut down)
Dataset
The dataset in the example below has been taken from snowplow.io, which receives hundreds of thousands of visitors annually. Less than 0.5% of total visitors convert into prospects via the demo request form.
Users are tracked by the cookies set when they initially visit the website. Of course, when the user clears the cookies, does not accept the cookie disclaimer or logs through a different device, this information is lost, resulting in multiple rows per the same user. Snowplow’s dbt web user model performs aggregation by cookies.[7]
Model features could be grouped into categories based on their origin:
- Temporal – created from first event timestamp: an hour of the day, day of the week.
- Landing Page – page title of the first URL, comes out of the box
- Device – User Agent enrichment [8]
- Referral – Referral enrichment[9]
- Marketing – Marketing campaign enrichment[10]
- Geographic – IP lookup enrichment[11]
- Robot – IAB enrichment[12]
- Engagement – Accumulated page ping events by dbt page view model [13].
Conversion events are taken from Salesforce, using different tracking methods. However, in practice Snowplow users could send a custom conversion event to avoid joining another data source. Read our documentation about setting this up [14].
Snowplow has been able to deliver high-quality behavioral data to Snowflake since earlier this year (2022). For this research, earlier data kept in Redshift had been partially migrated into Snowflake.
Most recent data should not be considered for the performance as some users have not converted yet. (The purchase cycle for Snowplow BDP can be long.)
Data preprocessing with Snowpark
Snowflake has released a limited access version of the Snowplow park python library (with the full version to be released soon. This library allows you to work with snowflake data as lazily evaluated DataFrame classes with an interface similar to pyspark DaraFrames; it has complete SQL functionality, with some additional features on top. It’s very easy to learn and would be familiar with SQL or Panda users (I was able to pick this up in a week).
There are a few essential imports and jupyter notebook style settings required initially.
import datetime
import imblearn.pipeline
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from IPython.core.display import HTML
from IPython.core.display_functions import display
display(HTML("<style>pre { white-space: pre !important; }</style>"))
pd.options.display.max_columns = None
pd.options.display.max_rows = None
First, a session should be established for the snowflake instance.
from snowflake.snowpark import Session
with open("pavel-snowflake-dev.properties") as file:
props = dict([l.rstrip().split(" = ", 2) for l in file.readlines() if " = " in l])
session = Session.builder.configs(props).create()
session.sql("ALTER SESSION SET TIMEZONE = 'UTC';");
Template of pavel-snowflake-dev.properties file:
account = XXXXXX
region = eu-west-1
user = USER
password = ********
DATABASE = DBNAME
schema = SCHEMA_NAME
warehouse = WHNAME
role = ROLE
Data comes from multiple sources. Historic data had been imported into Snowflake from our old Redshift instance. To reduce the volume of transferred data we have already joined the users and page_views in the historic tables.
The users dbt model had been slightly customized, adding the first session of each user, which is just an extra field alongside the already existing FIRST_PAGE_TITLE the FIRST_PAGE_VIEW_ID. This makes it easier to join users with the page_views.
CONVERTED_DATE field had been added by joining the custom conversion event with the Salesforce data.
So our four data sources are defined (see the schemas and description of each field in the snowplow dbt web model docs):
users_df = session.table("ANALYTICS_DEV_DB.DBT_EMIEL_DERIVED.SNOWPLOW_WEB_USERS")
page_view_df = session.table("ANALYTICS_DEV_DB.DBT_EMIEL_DERIVED.SNOWPLOW_WEB_PAGE_VIEWS")
rs2021_df = session.table("ANALYTICS_DEV_DB.DBT_EMIEL_DERIVED.REDSHIFT_FEATURES_2021")
rs2020_df = session.table("ANALYTICS_DEV_DB.DBT_EMIEL_DERIVED.REDSHIFT_FEATURES_2020")
Next all our sources of data are joined together. Note, how Snowpark DataFrame calls could be chained into the code that closely resembles an SQL statement. The result is the list of primary features before preprocessing.
from snowflake.snowpark.functions import coalesce, timestamp_ntz_from_parts, dayofmonth, year, month, minute, second, hour
combined_df = users_df\
.join(page_view_df, page_view_df.PAGE_VIEW_ID == users_df.FIRST_PAGE_VIEW_ID)\
.join(rs2021_df, rs2021_df.DOMAIN_USERID == users_df.DOMAIN_USERID, join_type='full')\
.join(rs2020_df, rs2020_df.DOMAIN_USERID == users_df.DOMAIN_USERID, join_type='full').select([
coalesce(rs2021_df.START_TSTAMP, rs2020_df.START_TSTAMP, users_df.START_TSTAMP).as_("START_TSTAMP"),
coalesce(rs2021_df.REFR_URLHOST, rs2020_df.REFR_URLHOST, users_df.REFR_URLHOST).as_("REFR_URLHOST"),
coalesce(rs2021_df.REFR_MEDIUM, rs2020_df.REFR_MEDIUM, users_df.REFR_MEDIUM).as_("REFR_MEDIUM"),
coalesce(rs2021_df.REFR_TERM, rs2020_df.REFR_TERM, users_df.REFR_TERM).as_("REFR_TERM"),
coalesce(rs2021_df.MKT_MEDIUM, rs2020_df.MKT_MEDIUM, users_df.MKT_MEDIUM).as_("MKT_MEDIUM"),
coalesce(rs2021_df.MKT_SOURCE, rs2020_df.MKT_SOURCE, users_df.MKT_SOURCE).as_("MKT_SOURCE"),
coalesce(rs2021_df.MKT_TERM, rs2020_df.MKT_TERM, users_df.MKT_TERM).as_("MKT_TERM"),
coalesce(rs2021_df.MKT_CAMPAIGN, rs2020_df.MKT_CAMPAIGN, users_df.MKT_CAMPAIGN).as_("MKT_CAMPAIGN"),
coalesce(rs2021_df.MKT_CONTENT, rs2020_df.MKT_CONTENT, users_df.MKT_CONTENT).as_("MKT_CONTENT"),
coalesce(rs2021_df.MKT_NETWORK, rs2020_df.MKT_NETWORK, users_df.MKT_NETWORK).as_("MKT_NETWORK"),
coalesce(rs2021_df.GEO_COUNTRY, rs2020_df.GEO_COUNTRY, page_view_df.GEO_COUNTRY).as_("GEO_COUNTRY"),
coalesce(rs2021_df.GEO_REGION, rs2020_df.GEO_REGION, page_view_df.GEO_REGION).as_("GEO_REGION"),
coalesce(rs2021_df.BR_LANG, rs2020_df.BR_LANG, page_view_df.BR_LANG).as_("BR_LANG"),
coalesce(rs2021_df.DEVICE_FAMILY, rs2020_df.DEVICE_FAMILY, page_view_df.DEVICE_FAMILY).as_("DEVICE_FAMILY"),
coalesce(rs2021_df.OS_FAMILY, rs2020_df.OS_FAMILY, page_view_df.OS_FAMILY).as_("OS_FAMILY"),
coalesce(rs2021_df.OPERATING_SYSTEM_CLASS, rs2020_df.OPERATING_SYSTEM_CLASS,
page_view_df.OPERATING_SYSTEM_CLASS).as_(
"OPERATING_SYSTEM_CLASS"),
coalesce(rs2021_df.OPERATING_SYSTEM_NAME, rs2020_df.OPERATING_SYSTEM_NAME, page_view_df.OPERATING_SYSTEM_NAME).as_(
"OPERATING_SYSTEM_NAME"),
coalesce(rs2021_df.SPIDER_OR_ROBOT, rs2020_df.SPIDER_OR_ROBOT, page_view_df.SPIDER_OR_ROBOT).as_("SPIDER_OR_ROBOT"),
coalesce(rs2021_df.FIRST_PAGE_TITLE, rs2020_df.FIRST_PAGE_TITLE, users_df.FIRST_PAGE_TITLE).as_(
"UNSAFE_FIRST_PAGE_TITLE"),
coalesce(rs2021_df.FIRST_PAGE_URLHOST, rs2020_df.FIRST_PAGE_URLHOST, users_df.FIRST_PAGE_URLHOST).as_(
"FIRST_PAGE_URLHOST"),
coalesce(rs2021_df.FIRST_PAGE_URLPATH, rs2020_df.FIRST_PAGE_URLPATH, users_df.FIRST_PAGE_URLPATH).as_(
"FIRST_PAGE_URLPATH"),
coalesce(rs2021_df.ENGAGED_TIME_IN_S, rs2020_df.ENGAGED_TIME_IN_S, page_view_df.ENGAGED_TIME_IN_S).as_(
"ENGAGED_TIME_IN_S"),
coalesce(rs2021_df.ABSOLUTE_TIME_IN_S, rs2020_df.ABSOLUTE_TIME_IN_S, page_view_df.ABSOLUTE_TIME_IN_S).as_(
"ABSOLUTE_TIME_IN_S"),
coalesce(rs2021_df.VERTICAL_PERCENTAGE_SCROLLED, rs2020_df.VERTICAL_PERCENTAGE_SCROLLED,
page_view_df.VERTICAL_PERCENTAGE_SCROLLED).as_(
"VERTICAL_PERCENTAGE_SCROLLED"),
coalesce(timestamp_ntz_from_parts(
year(users_df.CONVERTED_DATE),
month(users_df.CONVERTED_DATE),
dayofmonth(users_df.CONVERTED_DATE),
hour(users_df.CONVERTED_DATE),
minute(users_df.CONVERTED_DATE),
second(users_df.CONVERTED_DATE)
), rs2020_df.CONVERTED_DATE, rs2021_df.CONVERTED_DATE).as_("CONVERTED_DATE")
])
print(f"{'Column':30} Type")
print(f"--------------------------------------")
for field in combined_df.schema.fields:
print(f"{field.name:30} {field.datatype}")
Column Type
--------------------------------------
START_TSTAMP Timestamp
REFR_URLHOST String
REFR_MEDIUM String
REFR_TERM String
MKT_MEDIUM String
MKT_SOURCE String
MKT_TERM String
MKT_CAMPAIGN String
MKT_CONTENT String
MKT_NETWORK String
GEO_COUNTRY String
GEO_REGION String
BR_LANG String
DEVICE_FAMILY String
OS_FAMILY String
OPERATING_SYSTEM_CLASS String
OPERATING_SYSTEM_NAME String
SPIDER_OR_ROBOT Boolean
UNSAFE_FIRST_PAGE_TITLE String
FIRST_PAGE_URLHOST String
FIRST_PAGE_URLPATH String
ENGAGED_TIME_IN_S Long
ABSOLUTE_TIME_IN_S Long
VERTICAL_PERCENTAGE_SCROLLED Double
CONVERTED_DATE Timestamp
Snowplow data requires very little preparation before the ML pipeline can consume it. But it helps to add a few temporal features: HOUR and DAY_OF_WEEK.
PAGE_TITLE is chosen over URL as the same page could be reached by the different URL. That could be caused by referral URLs, Google Translate or the page could be moved by a web admin to a different address.
Some tracking events do not capture titles, for example, if they serve pdf. For these, it makes sense to replace them with the URL.
All constants must be wrapped in lit, like the empty replacement string in this snippet:
from snowflake.snowpark.functions import concat, dayname, hour, coalesce, is_null, iff, regexp_replace, replace, lit
from snowflake.snowpark.types import StringType
sp_df = combined_df.select(dayname("START_TSTAMP").as_("DAY_OF_WEEK"),
hour("START_TSTAMP").as_("HOUR"),
(~is_null(combined_df.CONVERTED_DATE)).as_("CONVERTED_USER"),
iff(
coalesce(combined_df.UNSAFE_FIRST_PAGE_TITLE, lit("")) == "",
concat(combined_df.FIRST_PAGE_URLHOST, combined_df.FIRST_PAGE_URLPATH),
combined_df.UNSAFE_FIRST_PAGE_TITLE
).as_("FIRST_PAGE_TITLE"),
is_null(coalesce("REFR_URLHOST", "REFR_MEDIUM" , "REFR_TERM")).as_("REFR_ANY"),
is_null(coalesce("MKT_MEDIUM", "MKT_SOURCE" , "MKT_TERM", "MKT_CAMPAIGN", "MKT_CONTENT", "MKT_NETWORK")).as_("MKT_ANY"),
*combined_df.columns
).where(
iff(is_null(combined_df.CONVERTED_DATE), True, combined_df.CONVERTED_DATE >= combined_df.START_TSTAMP)
).sort(combined_df.START_TSTAMP.asc())
To avoid exposing future information about the conversion, the dataset is split into 10 equal intervals.
from snowflake.snowpark.functions import max, min
begin, end = sp_df.select(min('START_TSTAMP'),max('START_TSTAMP')).collect()[0]
step_duration = (end - begin) / 10
step_names = []
step_full_names = []
step_dates = []
for i in range(4,10):
cutoff = begin + step_duration * i
steps_name = f"CONVERTED_RANGE_{cutoff.date()}"
step_dates.append(cutoff)
step_names.append(steps_name)
step_full_names.append(f" {begin + step_duration * (i-1)} from {cutoff.date()} to ")
sp_df = sp_df.with_column(steps_name, coalesce(sp_df.CONVERTED_DATE > cutoff, lit(False)))
At this point Snowpark DataFrame had not been materialized yet (SQL statement constructed but hasn’t been executed). Column name and types could be inferred before materialization. That is used to define powerful type-dependent transformations. For example, LightGBM requires column names to not have any JSON control characters, and it will also replace spaces with underscores. Most of the features are categorical and will be encoded downstream with OneHotEncoder. So we could define a function to apply to all String columns:
from snowflake.snowpark._internal.type_utils import ColumnOrName
from snowflake.snowpark import Column
from snowflake.snowpark.functions import regexp_replace, replace, substr
def lgbm_trim(col: ColumnOrName) -> Column:
return substr(replace(
regexp_replace(col, """[\"\'\b\f\t\r\n\,\]\[\{\}\"\:\\\\]""")
, " ", "_"), 0, 190).as_(col)
Finally, lgbm_trim function is applied to every text column. This function is about 3-5 times faster when executed within Snowflake than when the pd.DataFrame.str.replace method was executed on the AWS xlarge instance compares to XS Snowflake warehouse.
Data is exported directly into pandas DataFrame. This is done in a single call. Under the hood it transfers apache arrow objects and has less overhead than SQL driver.
%%time
df = sp_df.to_pandas()
df = df.replace("""[\"\'\b\f\t\r\n\,\]\[\{\}\"\:\\\\]""", '', regex=True).replace(" ", '')
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.slice(0, 200))
df.set_index("START_TSTAMP", inplace=True)
CPU times: user 18.7 s, sys: 1.95 s, total: 20.6 s
Wall time: 26.5 s
%%time
sp_df_clean = sp_df.select(
[lgbm_trim(f.name) if isinstance(f.datatype, StringType) else f.name for f in sp_df.schema.fields])
df = sp_df_clean.to_pandas()
df.set_index("START_TSTAMP", inplace=True)
CPU times: user 3 s, sys: 357 ms, total: 3.35 s
Wall time: 8.55 s
to_pandas infers the pd.DataFrame types during export.
df.dtypes
DAY_OF_WEEK object
HOUR int8
CONVERTED_USER bool
FIRST_PAGE_TITLE object
REFR_ANY bool
MKT_ANY bool
REFR_URLHOST object
REFR_MEDIUM object
REFR_TERM object
MKT_MEDIUM object
MKT_SOURCE object
MKT_TERM object
MKT_CAMPAIGN object
MKT_CONTENT object
MKT_NETWORK object
GEO_COUNTRY object
GEO_REGION object
BR_LANG object
DEVICE_FAMILY object
OS_FAMILY object
OPERATING_SYSTEM_CLASS object
OPERATING_SYSTEM_NAME object
SPIDER_OR_ROBOT object
UNSAFE_FIRST_PAGE_TITLE object
FIRST_PAGE_URLHOST object
FIRST_PAGE_URLPATH object
ENGAGED_TIME_IN_S int16
ABSOLUTE_TIME_IN_S int32
VERTICAL_PERCENTAGE_SCROLLED float64
CONVERTED_DATE datetime64[ns]
CONVERTED_RANGE_2020-12-16 bool
CONVERTED_RANGE_2021-03-14 bool
CONVERTED_RANGE_2021-06-09 bool
CONVERTED_RANGE_2021-09-05 bool
CONVERTED_RANGE_2021-12-02 bool
CONVERTED_RANGE_2022-02-27 bool
dtype: object
Results of the original query could be cached to improve performance of subsequent queries.
sp_df_cached = sp_df_clean.cache_result()
Exploratory Data Analysis
Snowpark could be used to deal with data sets that do not fit in memory of a single machine. Although, this Snowplow set is not this big, basic EDA was done in Snowpark for demonstration.
Another notable mention is Snowflake’s snowpark.DataFrame.sample method which returns a fraction of rows in a table. It was very useful accelerating prototyping in jupiter, but did not make it into this article.
All groups are categorical, except for engagement.
ref_cols = ["REFR_URLHOST", "REFR_MEDIUM", "REFR_TERM", "REFR_ANY"]
mkt_cols = ["MKT_MEDIUM", "MKT_SOURCE", "MKT_TERM", "MKT_CAMPAIGN", "MKT_CONTENT", "MKT_NETWORK", "MKT_ANY"]
geo_cols = ["GEO_COUNTRY", "GEO_REGION", "BR_LANG"]
dev_cols = ["DEVICE_FAMILY", "OS_FAMILY", "OPERATING_SYSTEM_CLASS", "OPERATING_SYSTEM_NAME"]
url_cols = ["FIRST_PAGE_TITLE"]
robot_cols = ["SPIDER_OR_ROBOT"]
calendar_cols = ["DAY_OF_WEEK", "HOUR"]
engagement_cols = ["ENGAGED_TIME_IN_S", "ABSOLUTE_TIME_IN_S", "VERTICAL_PERCENTAGE_SCROLLED"]
discrete_col = ref_cols + mkt_cols + geo_cols + dev_cols + robot_cols + calendar_cols + url_cols
continues_col = engagement_cols
Identifying balance of classes (converted against unconverted population) in our full data set. Again comparing materialised data set against Snowpark. Both took about the same time.
%%time
X = df[discrete_col + continues_col]
y = df.CONVERTED_USER
print(f"Number of users: {y.shape[0]}")
print(f"Number of conversions: {y[y].count()}")
print(f"Percent of conversions: {y[y].count() / y.shape[0] * 100:.3f} %")
Number of users: 659457
Number of conversions: 3120
Percent of conversions: 0.473 %
CPU times: user 256 ms, sys: 150 ms, total: 407 ms
Wall time: 663 ms
%%time
n_conversions = sp_df_cached.where(sp_df.CONVERTED_USER).count()
n_total = sp_df_cached.count()
print(f"Number of users: {n_total}")
print(f"Number of conversions: {n_conversions}")
print(f"Percent of conversions: {n_conversions / n_total * 100:.3f} %")
Number of users: 659457
Number of conversions: 3120
Percent of conversions: 0.473 %
CPU times: user 18.7 ms, sys: 10 µs, total: 18.8 ms
Wall time: 634 ms
It is important to identify if the data got cyclic patterns or other features like gaps or spikes.
from snowflake.snowpark import Window
from snowflake.snowpark.functions import to_date, avg
sp_df_w_date = sp_df_cached.with_column("ts_date", to_date(sp_df_cached.START_TSTAMP)).group_by(
"ts_date").count().with_column_renamed("COUNT", "ts_count")
window5d = Window.rows_between(-4, Window.CURRENT_ROW).order_by("ts_count")
ma5d = sp_df_w_date.select('ts_date', avg(sp_df_w_date.ts_count).over(window5d).as_("TS_COUNT_MA5D")).sort(
"ts_date").na.drop().to_pandas().astype({
'TS_COUNT_MA5D': 'float',
'TS_DATE': "datetime64"
}).set_index("TS_DATE").resample('1D').sum().fillna(0)
ma5d.plot(figsize = (13,4), title="5 day conversions moving average")
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b6271a400>
There are a few features of this chart.
- short cyclic trend
- large spikes in December 2021 and around July 2021
- data gap since Jan 2022
It could help to zoom into these areas:
from datetime import date
from matplotlib import pyplot as plt
_, axes = plt.subplots(nrows=4, constrained_layout=True, figsize = (13,12))
ma5d[(date(2020,1,1) <= ma5d.index.date) & (ma5d.index.date < date(2020,2,1))].plot(ax=axes[0], title='Conversions with cyclic trend between 2020-1-1 and 2020-2-1')
ma5d[(date(2021,6,1) <= ma5d.index.date) & (ma5d.index.date < date(2021,8,1))].plot(ax=axes[1], title='Spike of conversions between 2021-6-1 and 2021-6-1')
ma5d[(date(2021,10,1) <= ma5d.index.date) & (ma5d.index.date < date(2022,2,1))].plot(ax=axes[2], title='Conversions with missing data 2021-10-1 and 2022-2-1')
ma5d.tail(100).plot(ax=axes[3])
The volume of visitors drops on the weekend, explaining the short cyclic pattern.
Another place of interest is how long visitors take to convert.
from snowflake.snowpark.functions import datediff
data = sp_df_cached.where(sp_df_cached.CONVERTED_USER).select(
datediff('day',sp_df_cached.START_TSTAMP, sp_df_cached.CONVERTED_DATE).as_("lag")
).to_pandas()
data.plot(bins=100, kind='hist', title = "Conversion lag histogram", figsize=(13,7))
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b62da1280>
This shows that many users came to the website to request the demo within the first seconds of their visit. Which means they knew about the product. And it shows the imperfection of the tracking. For the web sites, where users log into their accounts, attribution is more precise.
sp_df_cached.select('START_TSTAMP', "CONVERTED_DATE").sort("START_TSTAMP").to_pandas().plot.scatter(x='START_TSTAMP', y='CONVERTED_DATE', figsize=(15, 10), title="Time gap between first visit and conversion")
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b62bd50d0>
So many customers were waiting for some trigger before conversion around 2020-07 and 2020-10.
For the further investigation of categorical features, we need a helper function to get classes with most variance. For categorical features, a variance threshold has an effect similar to most frequent common features.
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import VarianceThreshold
def enc_and_select_var(X_:pd.DataFrame, var=0.1) -> pd.DataFrame:
"""
Encode with one hot encoder and select features over variance threshold
:param X_: categorical dataframe
:param var: variance threshold
:return: transformed dataframe
"""
oh_enc = OneHotEncoder()
data = oh_enc.fit_transform(X_)
ecn_X_ = pd.DataFrame.sparse.from_spmatrix(data, columns=oh_enc.get_feature_names_out())
thresh = VarianceThreshold(var)
ecn_best_X_ = thresh.fit_transform(ecn_X_)
idx_mask = pd.Series(thresh.variances_)
columns_t = list(pd.Series(ecn_X_.columns)[idx_mask[idx_mask > thresh.threshold].index])
return pd.DataFrame.sparse.from_spmatrix(ecn_best_X_, columns=columns_t, index = X_.index)
df_2021_07_to_08 = sp_df_cached.where((~is_null(sp_df_cached.CONVERTED_DATE)) & sp_df_cached.CONVERTED_DATE.between('2021-07-01', '2021-07-31')).select(*ref_cols, *mkt_cols, 'START_TSTAMP').to_pandas().astype({
"START_TSTAMP": "datetime64"
}).set_index('START_TSTAMP')
ecn_best_df_2021_07_to_08_rolling = enc_and_select_var(df_2021_07_to_08).resample("1D").sum().rolling(10).mean()
ecn_best_df_2021_07_to_08_rolling2 = enc_and_select_var(df_2021_07_to_08, 0.05).resample("1D").sum().rolling(10).mean()
ecn_best_df_2021_07_to_08_rolling3 = enc_and_select_var(df_2021_07_to_08, 0.01).resample("1D").sum().rolling(10).mean()
_, axes = plt.subplots(nrows=10, constrained_layout=True, figsize = (13,40))
ecn_best_df_2021_07_to_08_rolling.filter(regex='MKT_.*(?<!None)$', axis=1).plot(kind='line', ax = axes[0], title="Marketing attributes of the initial touch for the users converted on 2021-07 during spike. (Variance 0.1 threshold)")
ecn_best_df_2021_07_to_08_rolling.filter(regex='REFR_.*(?<!None)$', axis=1).plot(kind='line', ax = axes[1], title="Referrer of the initial touch for the users converted on 2021-07 during spike. (Variance 0.1 threshold)")
ecn_best_df_2021_07_to_08_rolling2.filter(regex='MKT_.*(?<!None)$', axis=1).filter(like='MKT_', axis=1).plot(kind='line', ax = axes[2], title="Marketing attributes of the initial touch for the users converted on 2020-07 during spike. (Variance 0.05 threshold)")
ecn_best_df_2021_07_to_08_rolling2.filter(regex='REFR_.*(?<!None)$', axis=1).plot(kind='line', ax = axes[3], title="Referrer of the initial touch for the users converted on 2021-07 during spike. (Variance 0.05 threshold)")
ecn_best_df_2021_07_to_08_rolling3.filter(regex='MKT_SOURCE_.*(?<!None)$', axis=1).plot(kind='area', ax = axes[4], title="Marketing attributes of the initial touch for the users converted on 2021-07 during spike. (Variance 0.01 threshold)")
ecn_best_df_2021_07_to_08_rolling3.filter(regex='MKT_CAMPAIGN_.*(?<!None)$', axis=1).plot(kind='area', ax = axes[5], title="Marketing attributes of the initial touch for the users converted on 2021-07 during spike. (Variance 0.01 threshold)")
ecn_best_df_2021_07_to_08_rolling3.filter(regex='MKT_CONTENT_.*(?<!None)$', axis=1).plot(kind='area', ax = axes[6], title="Marketing attributes of the initial touch for the users converted on 2021-07 during spike. (Variance 0.01 threshold)")
ecn_best_df_2021_07_to_08_rolling3.filter(regex='MKT_SOURCE_.*(?<!None)$', axis=1).plot(kind='area', ax = axes[7], title="Marketing attributes of the initial touch for the users converted on 2021-07 during spike. (Variance 0.01 threshold)")
ecn_best_df_2021_07_to_08_rolling3.filter(regex='REFR_URLHOST_.*(?<!None)$', axis=1).plot(kind='area', ax = axes[8], title="Referrer of the initial touch for the users converted on 2021-07 during spike. (Variance 0.01 threshold)")
ecn_best_df_2021_07_to_08_rolling3.filter(regex='REFR_MEDIUM_.*(?<!None)$', axis=1).plot(kind='area', ax = axes[9], title="Referrer of the initial touch for the users converted on 2021-07 during spike. (Variance 0.01 threshold)")
There are few conclusions that could be made from the conversion spike data
- Most converted visitors came through Google search. (REFR_MEDIUM figure)
- Marketing webinars were successful at attracting visitors that later converted. (MKT_CAMPAIN figure)
- Google Ads was a significant contributor (MKT_SOURCE figure)
- Email campaigns were effective (MKT_SOURCE figure)
A similar approach could be taken to drill down into other features.
Checking landing pages:
from snowflake.snowpark.functions import col
sp_df_cached.where(sp_df_cached.CONVERTED_USER).select('FIRST_PAGE_TITLE').group_by('FIRST_PAGE_TITLE').count().sort(col("COUNT").desc()).show(30)
----------------------------------------------------------------
|"FIRST_PAGE_TITLE" |"COUNT" |
----------------------------------------------------------------
|Collect_manage_and_operationalize_behavioral_da... |453 |
|Snowplow_|_The_Data_Collection_Platform_for_Dat... |326 |
|Treating_data_as_a_product |158 |
|White_paper_Introductory_guide_to_data_modeling... |145 |
|Collect_and_operationalize_behavioral_data_at_s... |142 |
|Data_modeling_101 |102 |
|eBook_Using_data_to_develop_killer_products_|_S... |96 |
|White_paper_Transform_behavioral_data_into_acti... |93 |
|Snowplow_|_Introducing_DataOps_into_your_data_m... |76 |
|Market_Guide_for_Event_Stream_Processing_|_Snow... |76 |
|Snowplow_|_Explicit_vs_implicit_tracking |57 |
|Avoid_drowning_in_your_data_lake |52 |
|Webinar_The_roadmap_to_data-infomed_customer_jo... |51 |
|Level_up_your_marketing_attribution_with_the_ad... |48 |
|The_Future_of_Web_Analytics_A_fireside_chat_wit... |46 |
|White_paper_A_guide_to_better_data_quality_|_Sn... |43 |
|Gartner_Report_Introducing_DataOps_into_your_da... |41 |
|White_paper_How_to_manage_your_data_lake_effect... |39 |
|Snowplow_|_Data_modeling_mini_series_-_Data_mod... |38 |
|Snowplow_|_Lessons_learned_from_growing_a_data_... |37 |
|Snowplow_Webinar_How_Strava_built_a_self-serve_... |37 |
|Rethinking_web_analytics_for_the_modern_age |30 |
|Snowplow_webinar_Behavioral_analytics_in_real_time |28 |
|Snowplow_|_Identity_resolution_in_a_privacy_con... |25 |
|Snowplow_Webinar_Whats_in_store_for_behavioural... |23 |
|Snowplow_Webinar_The_new_category_of_the_revers... |23 |
|Get_started_with_your_Snowplow_Insights_demo_|_... |21 |
|Webinar_How_to_get_your_data_team_a_seat_at_the... |21 |
|Map_your_end-to-end_customer_journey_with_behav... |20 |
|Snowplow_|_People_and_process_in_data_-_A_fires... |19 |
----------------------------------------------------------------
sp_df_cached.select('FIRST_PAGE_TITLE').group_by('FIRST_PAGE_TITLE').count().sort(col("COUNT").desc()).show(30)
----------------------------------------------------------------
|"FIRST_PAGE_TITLE" |"COUNT" |
----------------------------------------------------------------
|White_paper_Introductory_guide_to_data_modeling... |103994 |
|Collect_manage_and_operationalize_behavioral_da... |98423 |
|Snowplow_|_The_Data_Collection_Platform_for_Dat... |72360 |
|Treating_data_as_a_product |44614 |
|White_paper_Transform_behavioral_data_into_acti... |17107 |
|Collect_and_operationalize_behavioral_data_at_s... |16499 |
|Data_modeling_101 |14545 |
|The_top_14_open_source_analytics_tools_in_2021_... |10039 |
|eBook_Using_data_to_develop_killer_products_|_S... |9762 |
|Snowplow_Analytics_Privacy_Policy_|_Snowplow |8630 |
|White_paper_A_guide_to_better_data_quality_|_Sn... |8594 |
|Rethinking_modern_web_analytics |6144 |
|Level_up_your_marketing_attribution_with_the_ad... |5730 |
|Avoid_drowning_in_your_data_lake |5497 |
|Work_for_Snowplow_and_join_our_growing_team_|_S... |4798 |
|Snowplow_Careers_|_Come_And_Be_a_Part_of_Our_Team |4369 |
|White_paper_How_to_manage_your_data_lake_effect... |4356 |
|Get_started_with_your_Snowplow_Insights_demo_|_... |3868 |
|An_introduction_to_event_data_modeling |3702 |
|Privacy_policy |3635 |
|Three_Ways_to_Derive_Information_From_Event_Str... |3392 |
|Building_a_model_for_event_data_as_a_graph |3371 |
|Snowplow_|_Introducing_DataOps_into_your_data_m... |3313 |
|Snowplow_Insights_|_A_Data_Collection_Platform_... |2855 |
|Market_Guide_for_Event_Stream_Processing_|_Snow... |2732 |
|Introducing_SchemaVer_for_semantic_versioning_o... |2666 |
|A_guide_to_data_team_structures_with_6_examples... |2664 |
|Snowplow_&_Census_eBook |2528 |
|A_guide_to_better_data_quality |2354 |
|The_leading_open_source_behavioral_data_managem... |2326 |
----------------------------------------------------------------
Landing page the converted users look different to the general population. So we could expect a lift in model performance from adding this feature.
Missing values
One common approach to missing values is to extrapolate from the other features with KNN or similar methods. But it would only makes sense, if these features defined in the same row. Snowpark ability to reflect column names comes useful here again.
na_df = sp_df_cached.select([is_null(f).as_(f) for f in sp_df_cached.columns]).to_pandas()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(na_df.corr(), interpolation='nearest', cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)
fig.colorbar(cax)
xaxis = np.arange(len(df.columns))
ax.tick_params(axis='x', labelrotation=270)
ax.set_xticks(xaxis)
ax.set_yticks(xaxis)
ax.set_xticklabels(df.columns)
ax.set_yticklabels(df.columns)
ax.set_xticklabels([''] + df.columns)
ax.set_yticklabels([''] + df.columns)
ax.set_title("N/A correlation matrix", y=1.31)
fig.set_size_inches(12, 10)
It would be wrong to fill in marketing source, because missing values there are meaningful. BR_LANG and GEO_* could be interpolated from other features as they show small correlation.
from snowflake.snowpark.functions import datediff
_, axes = plt.subplots(2, 3, figsize=(14, 14))
data_converted = sp_df_cached.where(sp_df_cached.CONVERTED_USER).select(
*engagement_cols
).to_pandas()
data = sp_df_cached.select(
*engagement_cols
).to_pandas()
data.VERTICAL_PERCENTAGE_SCROLLED.plot(bins=100, kind='hist', title = "VERTICAL_PERCENTAGE_SCROLLED for all users", ax=axes[0][0])
data.ENGAGED_TIME_IN_S.plot(bins=100, kind='hist', logy=True, title = "ENGAGED_TIME_IN_S for all users", ax=axes[0][1])
data.ABSOLUTE_TIME_IN_S.plot(bins=100, kind='hist', logy=True, title = "ABSOLUTE_TIME_IN_S for all users", ax=axes[0][2])
data_converted.VERTICAL_PERCENTAGE_SCROLLED.plot(bins=100, kind='hist', title = "VERTICAL_PERCENTAGE_SCROLLED for converted users", ax=axes[1][0])
data_converted.ENGAGED_TIME_IN_S.plot(bins=100, kind='hist', logy=True, title = "ENGAGED_TIME_IN_S for converted users", ax=axes[1][1])
data_converted.ABSOLUTE_TIME_IN_S.plot(bins=100, kind='hist',logy=True, title = "ABSOLUTE_TIME_IN_S for converted users", ax=axes[1][2])
Continues features have logarithmic distribution. For linear models they need to be scaled.
References
[1] Shao, Xuhui, and Lexin Li. “Data-driven multi-touch attribution models.” Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 2011.
[2] R. Berman, “Beyond the Last Touch: Attribution in Online Advertising,” Marketing Science, vol. 37, no. 5, pp. 771–792, Sep. 2018, doi: 10.1287/mksc.2018.1104.
[3] https://developers.google.com/ads-data-hub/reference/table-schemas/vendors#google-ads-campaign
[4] Wright, Raymond E. “Logistic regression.” (1995).
[5] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T.Y., 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
[6] https://shap-lrjball.readthedocs.io/en/latest/index.html
[7] https://snowplow.github.io/dbt-snowplow-web/#!/model/model.snowplow_web.snowplow_web_users
[8] https://docs.snowplow.io/docs/enriching-your-data/available-enrichments/yauaa-enrichment/
[9] https://docs.snowplow.io/docs/enriching-your-data/available-enrichments/ua-parser-enrichment/
[10] https://docs.snowplow.io/docs/enriching-your-data/available-enrichments/cookie-extractor-enrichment/
[11] https://docs.snowplow.io/docs/enriching-your-data/available-enrichments/ip-lookup-enrichment/
[12] https://docs.snowplow.io/docs/enriching-your-data/available-enrichments/iab-enrichment/
[13] https://snowplow.github.io/dbt-snowplow-web/#!/model/model.snowplow_web.snowplow_web_page_views