Data Discussions: Lodi Elaridi, Data Product Manager from Skyscanner
Treating ‘Data as a Product’ at Skyscanner
Skyscanner is a leading travel search engine, providing travelers with one place to compare prices for flights, accommodation, and car rentals. To serve users with the most relevant results, they collect and operationalize data, shaping the development of features such as ‘multi-city’ search, ‘everywhere’ search, ‘Best Time to Book’, and price alerts. In this post, we speak to Skycanner’s Data Product Manager, Lodi Elaridi, to learn more about how data is fuelling the growth of the company, and what challenges their data team has encountered.
What is the core focus of your data (or product) team at this point in time?
My team is working on improving the stability and quality of traveler data (data emitted from our product that is used for product and business enhancement) from data emission to data visualization. We’re working across many teams to provide a framework that can be scaled to all traveler data.
What overlooked challenges with data have you seen that you think are likely to be focused on in the near future?
TL;DR – Data management at scale
Data that is used for reporting and decision making has a basic requirement around being stable and consistent, however product teams will naturally want to make changes to the product which can lead to breaking changes to the data emitted. The challenge is around bridging the gap between the needs of data consumers and product teams that emit the data, while not creating a bottleneck for the data engineers building ETL in the middle. Many interesting things are happening in this space such as data mesh, but the dust hasn’t settled on a new standard solution.
It seems like a common problem is that data governance (a good example of what I’m referring to is presented here) and management in modern fast paced companies become background processes owned by central teams that can’t keep up with the rate of change of the product. And the feedback loop to the ROI of these processes is very slow, which makes it very difficult to prioritize spend in these areas. What makes this process even slower is the difficulty around tracking metadata around data lineage in an easily consumable way, both upstream to the product engineering teams that emit it, and downstream to the consumers who use it to report on the business. When breaking changes to the data happen upstream, the data engineers owning the ETL in the middle struggle to adapt to the changes quickly due to the complex downstream dependencies, which altering logic or source data tends to break. This impacts consumers of the data, and ideally the impact would be identified early on through quality checks and monitoring – another aspect of data management that is gaining increasing attention.
Making it easier for product teams to make changes without breaking downstream dependencies, as well as being able to identify and react to these changes as they happen, is likely going to become a bigger focus than it used to be. This type of work doesn’t get the attention it deserves until the costs of not doing it start to add up, but I think it is finally gaining visibility as an industry-wide problem.
Looking further ahead, more strategically, we’ll likely see data leading organizations use data in a variety of new ways, what do you think these might be?
TL;DR Data as a Product
The current picture already sees data leading organizations doing a lot with data, for example:
However, the bigger and more diversified an organization becomes, the more difficult it is to do all of the above on reliable data. So the future should and hopefully will see “data as a product” becoming the standard for data leading organizations.
This will lead to segmentation of data consumers, and different levels of standards for the data products that serve their different needs. So for example, the data needs of product managers and product teams who need to make decisions quickly based on how users are interacting with their product, may be very different to the needs of business reporting. And that’s ok. When we think of data as a product, and we allow teams owning that product to own the whole data lifecycle, we allow them to move quickly and not be hindered by the rigidity of centralized data pipelines. Moving away from insisting on one source of truth to serve all needs may have to happen as we start to realise the bottlenecks this mantra has created, and the fact that it hasn’t necessarily been achievable. It hasn’t been achievable because there are usually many offshoots to the same data source – queries, or siloed notebooks that don’t meet production standards. And this is a symptom of a centralized data model trying to meet all data needs. For business data reporting, that requires to be stable over time, one source of truth would lead to less inconsistencies across the company and makes sense, however, we focus more on the mantra than what it means. If we are talking about SDKs used to emit events, or the actual datasets or the platform or tooling for example. To which of these are we applying the one source of truth lens?
Data availability/completeness/quality/coverage etc will start to be taken more seriously, hopefully as much as customer facing applications. Given the huge dependencies on data in a lot of products, SLAs against data will become norms, rather than exceptions – in the same way that SLAs for service availability are the norm. The SLAs would be different depending on the use case and the customer, so that we don’t create unnecessary stringent rules around all data equally, that could become very costly to maintain as data scales.
(Please note, the above represents the opinions of Lodi Elaridi only, and not Skyscanner)