Improving Machine Learning Models by using Behavioral Data

Federico Castanedo

April 24, 2023

Share this post

Behavioral data is generated from the actions or behaviors of individuals or groups. In this article, we will demonstrate the benefits of using behavioral data, particularly web sessions data, to improve the accuracy of machine learning models.

We will begin by explaining what we mean by behavioral data, and then delve into the reasons why this type of data is under utilized in machine learning models. Next, we will present a practical example from a Kaggle competition in which the goal was to predict where a new user will book their first travel experience using a dataset released by Airbnb [1].

Finally, after discussing this example, we will examine how Snowplow simplifies and automates the process of utilizing these types of data in your machine learning pipelines.

What is Behavioral Data?

Behavioral data refers to any type of data that is generated from the actions or behaviors of individuals or groups. This type of data can include things like social media interactions, website clicks, purchase history and other types of interactions. It is a detailed record of interactions with customers, partners, applications and systems; and can be collected from both digital sources, such as web and social networks, as well as physical sources like sensor data events.

Web data, a type of behavioral data, is a valuable resource for understanding user behavior on websites. It records the actions that users take on a website and can provide a wealth of information about their engagement with the site. With web data, you can track detailed information about a user's interactions with a website, including page views, clicks on specific elements of the page, time spent on different parts of the page, and purchases. Additionally, web data may include information about the user's device, such as the type of device, operating system, and browser being used.

Behavioral data is a valuable resource for understanding and predicting the behavior of customers or users. It gives you a full view of a customer's journey, from their first interaction with a brand to making a purchase, and can be used to inform marketing and product development strategies. By analyzing behavioral data, businesses can gain insights into how to optimize the customer experience and drive conversions. For example, they can identify patterns in customer behavior and use this information to create targeted marketing campaigns or to improve the design and functionality of their website or product.

As shown in Figure 2, the components of behavioral data can be broadly divided into three groups that mirror a real language structure:

Main entity (subject) – The main actor or entity, normally the user or customer;
Event (verb) – Describes the event (e.g. ‘button click’);
Other entities & properties (objects) – Context to better understand the event (e.g. location of event).

To illustrate behavioral data with a specific example you can see the granular events provided by Snowplow’s web tracker in this link: https://snowplow.io/explore-snowplow-data-part-1/

Why is it uncommon to use Behavioral Data in Machine Learning models?

There are several reasons why behavioral data is not widely used in machine learning models by the data science community. In this section, we will summarize the main ones.

1. Complexity behavioral data can be complex and difficult to work with for a number of reasons. One reason is that it can be scattered across multiple sources and systems, such as different devices, web data, social media data, etc. This can make it difficult to extract and consolidate the data into a usable format, especially if the data is stored in different formats or data schemas.

Behavioral data is a collection of actions and events that take place at specific points in time. Therefore to fully understand the context and significance of these actions and events, it is necessary to analyze the sequence in which they occur. This can be challenging, as it requires a thorough understanding of the application, service, or product in which the engagement takes place. Traditional statistical methods are not always effective for analyzing behavioral data, as aggregating and interpreting this type of data can be complex. Instead, it is necessary to use specialized tools and techniques to extract, prepare and analyze the data in a meaningful way because behavioral data is inherently complicated and statistical functions like counting the number of the actions, computing averages or standard deviations doesn’t work.

2. Data lineage: another challenge with behavioral data is the issue of data lineage, which refers to the history of the data and how it has been collected and processed. It includes information about the sources of the data, the methods used to collect it, and the transformations and manipulations that have been applied. It is important to understand the data lineage of behavior data in order to ensure its accuracy and reliability, but this can be difficult due to the complexity of the data and the fact that it may be collected from multiple sources and devices over time. Lack of clear data lineage can lead to issues such as data quality problems and difficulty in tracing errors or inconsistencies in the data. For example, concepts such as "sessions", "marketing channels", and "visitors" are attributed values with complex data definitions. If we don't fully understand how these concepts are calculated, it can be difficult to understand the generated data and use it to make reliable predictions.

3. Limited availability: another reason why behavioral data is not widely used in machine learning models is that it may be difficult to obtain a sufficient amount of high-quality data to train the model. Behavioral data is not always widely available, and it may be difficult to collect a large and diverse dataset that is representative of the population or user group being studied. This can be especially challenging in cases where the data is being collected from multiple sources or devices, as it may require special tools or techniques to extract and consolidate the data.

Another factor that can limit the availability of behavioral data is the increasing adoption of intelligent tracking prevention (ITP) initiatives by major tech companies [2][3]. ITP is an initiative that is designed to limit the tracking of users' online behavior by third parties, and it can have a significant impact on the availability and quality of behavioral data. ITP limits cross-site tracking activities and currently impacts mostly Apple users but there are similar initiatives in other browsers (ie. Firefox, Brave) in the form of tracking protection and prevention. For example, ITP limits tracking to only two days for Safari users, which can heavily skew the resulting data and make it difficult to accurately model user behavior over time.

4. Compliance challenges: compliance challenges associated with the use of behavioral data in machine learning models can be significant. This is because such data may reveal sensitive information about individuals, which must be carefully managed in accordance with local laws and regulations. For example, there may be strict guidelines in place regarding the collection and use of personal data, which can make it difficult to obtain and use behavioral data for machine learning purposes. Additionally, the use of behavioral data in machine learning models may raise concerns about privacy and the potential for sensitive information to be misused. As a result, organizations using behavioral data must be sure to carefully follow all relevant laws and regulations, and to implement appropriate safeguards to protect the privacy of individuals. This may involve seeking consent from individuals before collecting and using their behavioral data, as well as implementing robust security measures to prevent unauthorized access to such data. Overall, the use of behavioral data in machine learning can present significant compliance challenges, but with the right tools, it is possible to successfully navigate these challenges and use this data effectively to drive business value.

Despite the previous challenges, behavioral data has been successfully used by large organizations with significant resources to drive profitability through online advertisements.

A particular area where behavioral data has played a key role has been CTR (Click-through rate) prediction models. These models predict the likelihood that a particular user will click on a specific link or button on a website. CTR prediction is used in digital marketing to optimize the placement and design of ads and other clickable elements on a website. In recent years, deep learning models have been applied extensively to the CTR prediction task. We refer the interested reader to [4] for more details.

Now, let's consider a practical example of how we can incorporate behavioral data in machine learning models in order to improve prediction accuracy.

A practical example: Where a new user will book their first travel experience?

Back in 2015, Airbnb hosted a Kaggle competition[1] with the goal of predicting where a new user would book their first travel experience. With over 34,000 cities in 190 countries available for booking on the platform, this is a complex problem. By accurately predicting where a new user would book their first stay, Airbnb could show more personalized content to that user, reduce the average time it takes for a new user to make their first booking, and improve demand forecasting across different locations.

The data for this example is publicly available at the Kaggle competition website [5]. The dataset contains web users from the USA and includes their demographics, web session records, and summary statistics. There are twelve possible destination countries in the dataset that participants were asked to use to make predictions on the test set for submission during the competition. While the competition is now closed, you can still generate your own predictions and evaluate their quality using the Kaggle platform, and you can use the attached notebookto replicate our results. Please note that you must register in Kaggle and use your own API key.

For this exercise, we are interested in understanding the impact of using behavioral data on model performance. By behavioral data in this example, we refer to the web sessions data of the users, which include information such as the action taken, the type of action, the details of the action, the device type, and the elapsed time, as shown in the examples below.

So, the session dataset contains multiple records for each user, with each record capturing the user's actions and the time spent on the website performing those actions.

In the next picture you can see the distribution of the different actions in the dataset. For example, you can see that the majority of the time is spent on the "show" activity, while less time is spent on reviewing “similar listings” on the website.

Data Preparation

As with any machine learning project, the first step is to perform data preparation. In this case, we want to make use of web sessions data to evaluate the improvement in model performance. Therefore, we will train two machine learning models: one using only customer demographic and registration features, and another one using the same data plus web sessions data.

You can see the details in the ‘compute_feature_engineering()’ method of the attached python notebook, but essentially we are computing the year, month and day of the account creation as well as when the user logged in for the first time. We also addressed some data quality issues with the Age variable and performed one-hot encoding for categorical variables such as gender and signup method.

After training an xgboost machine learning model and calculating the information gain of the top 20 variables, the following graph is produced.

The most relevant features for capturing the signal from the noise in this problem are the signup method, signup app, gender, and age, as shown in the graph. However, we can further enrich these features with more web session or behavior data. For this example, we will compute the total, maximum, minimum, mean, and standard deviation of the elapsed time (in seconds) per action type and ID. You can see the code for this in the get_sessions_features() method in the notebook. With these new features, the following information gain plot is obtained:

As we can see, the second most relevant feature for this model is the total amount of time (in seconds) that the user spends on the booking request page, which is intuitive for this use case. Additionally, we can see that 5 out of the top 20 features came from the use of web session data, without the need for complex feature engineering. This demonstrates the value of incorporating web session data in the model.

Models Performance

We now have two different models trained with two datasets, a basic one and the second one with more behavioral data features. It is time to evaluate their performance on the test set provided in the Kaggle competition and perform an unbiased and blind evaluation of their predictive accuracy.
For this competition, the suggested metric was the Normalized Discounted Cumulative Gain (NDCG) at the top-5, so a maximum of 5 predictions per each user in the test set were required. You can see the details in the attached notebook, which generates two different CSV files in the required format. You can see the obtained results in the private and public leaderboards.

To put these numbers in context, the second place in this competition achieved a private score of 0.88682 using more than 1000 features and a combination of over 20 machine learning models [6]. With this example we are getting a closer number with a simpler model.

Based on these results, it is clear that using web behavioral data can significantly improve predictions in a real and practical example like this one. However, you may wonder if the benefit comes with an extra cost of data reconciliation, cleaning, and preprocessing if an organization wanted to do something similar. We agree that the data for this competition was already prepared by Airbnb, so it is not representative of the effort required to prepare something similar in your organization. But there are tools and platforms available in the market which helps with the data preparation. In the next section, we will explain how Snowplow, an open source event data collection platform, simplifies and automates the process of using behavioral data in machine learning models.

How does Snowplow make it easier to use behavioral data to improve machine learning models?

Snowplow is a platform for creating behavioral data optimized for AI and advanced analytics and simplifies the process of collecting, storing and analyzing behavioral data. There are several features that Snowplow provides to make working with the data particularly easy when using it to engineer features for machine learning models.

Creating the richest possible dataset

Snowplow has been designed to create the richest possible data set by allowing the collection of more granular and detailed data, including event-level data. This means, every time an event occurs, a data record is created that describes that event. This new observation is made up of an object that describes the event itself, and an array of “entities” or contexts that describe either the things in the world that were involved in the event, or that provided context for the event that took place.

As an example, an event that recorded a user searching for a particular car on an automotive site might include:

An event that described the search, including the keywords and any filters.
An entity that described the person performing the search. This might include things like the cookie ID, username, customer type.
An array of entities, one for each of the results that was displayed, including the position of each result, and key data points for each result (key facts about the car that are displayed in the listing).
An entity that describes the web page that the event has occurred on - including the URL, the page title, the type of page, and ID to distinguish distinct views of the same page URL (e.g. if the user has multiple tabs open).
A series of entities that describe the browser, operating system and device.
An entity that describes the location the search was performed in.

The ability to include unlimited entities is key because the combination of these might be predictive of the likelihood of the user to make a relevant decision. As a relevant decision in this example we may consider to select a particular listing, or to subsequently book a test drive and then go on to buy the car.

Delivering consistent, structured and predictable data

Snowplow's use of schemas is a powerful feature that makes it easier to work with behavioral data in a reliable and efficient way. By using schemas, Snowplow ensures that events and entities have a consistent and predictable structure, making it easier to engineer features out of them for machine learning models.

One of the key benefits of schema versioning is that it allows the structure of the data to evolve in a systematic way as the needs of the business change. For example, as the user experience changes, it may become necessary to capture additional properties for existing entities, or to add new entities to a particular event. With schema versioning, these changes can be made in a controlled and transparent way, ensuring that downstream applications continue to function properly.

In addition, Snowplow's use of schemas makes it easier to maintain a consistent understanding of the data across different teams and applications. By providing a clear definition of each entity and event, schemas help ensure that everyone is working with the same data, which can help avoid confusion and errors.

Ensuring data accuracy and completeness

Behavioral data is often plagued by inaccuracies and incompleteness due to various reasons, including:

It is easy to unintentionally disrupt tracking when making changes to a website or mobile application, as changes can impact the tracking mechanisms without it being immediately apparent.
Implementation of tracking can be complex, and it can be challenging to ensure that all edge cases have been accounted for.
Privacy technologies like ad blockers and Intelligent Tracking Prevention (ITP) can degrade tracking reliability by removing cookies used to identify anonymous users or preventing tracking tags from loading or sending data.

To address these challenges, Snowplow provides a range of features that help organizations ensure the accuracy and completeness of their behavioral data. These include:

Snowplow Micro enables automated tests to be written to check that key events are properly tracked as part of Continuous Integration/Continuous Deployment (CI/CD), and can even fail builds if tracking is inadvertently broken.
All data is validated in real-time against associated schemas, with high-fidelity alerts that not only identify where new problems emerge but also pinpoint where the problem originated so that it can be quickly addressed. A workflow is in place to recover failed events.
As a first-party tracking solution, Snowplow can be integrated tightly with an organization's website, ensuring that cookies persist and tracking is not blocked.

By leveraging these features, Snowplow provides organizations with the assurance they need to rely on their behavioral data, enabling them to make more informed decisions and improve their machine learning models.

Data is automatically aggregated

Aggregating behavioral data to the appropriate level is crucial for effective feature engineering. Snowplow offers pre-built and extensible models that aggregate data to common levels of granularity, including page and screen views, sessions, and users.

These models not only provide data that is ready for use in AI and BI applications, but are also designed to efficiently handle large volumes of high-velocity data in an incremental way.

In addition, common predictive features such as "time spent" are automatically and accurately calculated as part of these models, using a heartbeats-based approach.

Summary and Conclusions

This blog post discusses the benefits of using behavioral data, specifically web sessions data, to improve the accuracy of machine learning models. It provides a definition of behavioral data as any data generated from the actions or behaviors of individuals or groups, including social media interactions, website clicks, and purchase history. Behavioral data is useful for understanding and predicting customer behavior and can be used to inform marketing and product development strategies.

We presented why behavioral data is not commonly used in machine learning. One of the main reasons is the complexity of the data, which may be scattered across multiple sources and systems, making it difficult to extract and consolidate. Data lineage is another challenge with behavioral data, as it is important to understand the history of the data to ensure its accuracy and reliability, which can be difficult to do. There are also compliance challenges associated with the use of behavioral data as it may reveal sensitive information about individuals that must be carefully managed according to local laws and regulations.

To illustrate the benefits of using behavioral data in machine learning models we presented an example from a Kaggle competition hosted by Airbnb aimed at predicting where a new user would book their first reservation in the platform. Obtained results from this dataset demonstrate the improvements of enriching training data with behavioral features and the corresponding improvement in predictions accuracy.

Finally, we introduced and explained the benefits of using Snowplow, an enterprise Behavioral Data Platform, to simplify and automate the process of utilizing behavioral data in machine learning pipelines. Snowplow provides a standard way to collect, aggregate and create the richest possible behavioral dataset. It also implements the appropriate safeguards to protect the privacy of individuals. For these reasons, Snowplow is an excellent choice to provide data that is ready for use in Machine Learning or AI applications.