Snowplow Analytics builds an open source data pipeline and offers it as a managed service (Snowplow BDP) within a customer’s cloud account (what we call “private SaaS”). The managed service console, a proprietary offering, provides visibility and control of the pipeline as well as tools to manage data collection and modeling. Up until a few months ago the permissions system was very simple: a console user was either an administrator with full access everywhere, or a simple user with no permissions to affect the production pipeline. Last spring we decided to change that and create a fine-grained access control system to enable per-user and per-resource policies, so that we can offer the level of control that enterprise customers expect by default.
Now, whoever has embarked on a mission to implement any kind of non-trivial policy and the supporting system knows that this is the mother of all rabbit holes. We knew that when we started, and had many opportunities to confirm this knowledge on the way. Thankfully we are a practical bunch, and early on we decided that we will not try to build the supporting infrastructure in-house unless absolutely necessary.
Our main requirements where
- Minimum operational effort: After integrating the system we would use, maintenance effort should be as little as possible.
- Flexibility on use cases: Expressing new policies and updating the existing ones should be easy, intuitive, and require no development in our backend if possible.
- Solid community and/or a company to support us if we would hit a wall at any point.
- Ideally we wanted to have policies drive both the backend (an action can/cannot be performed) and the UI (the button to execute an action is/isn’t displayed).
The last one we felt was some kind of UX utopia that perhaps wasn’t possible with existing tooling, nevertheless it was in our wish list.
Enter Open Policy Agent
We started looking for options and it very quickly became evident that our best bet was the Open Policy Agent (OPA). The proposition was compeling from the start:
- A flexible, declarative language (Rego) to express policies outside of code.
- Testable policies.
- A service that we could deploy side-by-side with our backend, not a library. We run on a purely functional Scala backend, and we avoid introducing new libraries when possible/sensible to minimize the number of imported dependencies.
- Partials. Very early on this feature captured our attention as something we could use towards our “single policy drives backend and UI” nirvana.
- A very active user community *and* a company (Styra, the OPA creators) should we need help.
After a short prototyping phase where all of our hypotheses were successfully validated, we decided to move on with OPA.
Our services run on AWS ECS and we use Terraform for our deployments. Our aws_ecs_task_definition includes both the container for our own backend and a list of sidecars (just OPA for the time being). If you are not familiar with the concept of sidecars, this blog post by AWS explains it intuitively with an example.
This setup minimizes network latencies and the surface for connectivity errors, while keeping things tidy with full separation between our backend and the policy engine. After we wrote that initial Terraform code, no maintenance is required other than bumping OPA versions every now and then.
Generalizing authorization concepts
The question we need to ask OPA is whether a subject (console user or machine-to-machine application using the console API directly) has access to a particular resource, set of resources, or class of resources. This is intuitively generalized in the following structure:
sealed abstract class AuthzOp[R, F, RID: Coercible[Refined[String, P], *], P] private ( val action: Action, val resourceType: ResourceType, val resourceId: Option[RID], val resourceForm: Option[F], val resources: List[R] )
We have defined an enumeration of generic CRUD-style actions, different classes of resources (ResourceType), optionally an ID of a particular resource, and a potentially empty list of resources with all their details.
For example, if we want to check whether someone is allowed to list event shapes (what is known as “data structures” in Snowplow BDP terminology), we would build an instance of the following record as input to OPA:
case object ListDataStructure extends AuthzOp( Action.List, ResourceType.DataStructures, none[DataStructure.Id], none[DataStructure], List.empty[DataStructure] )
If, on the other hand, we want to check for one particular data structure then we build an instance of the following record:
case class ViewDataStructure(dataStructure: DataStructure) extends AuthzOp( Action.View, ResourceType.DataStructures, dataStructure.id.some, none[DataStructure], List(dataStructure) )
These records are combined with an authorization context containing information about the subject that tries to execute the action and the full list of permissions this subject has on record, and fed to OPA which responds indicating whether the action is permitted or not according to the Rego policies we have added to it.
Bringing this to the UI
To enable reusing the same policies in the UI there were a few options; one is to compile policies as WASM that we could run in the browser. Our initial investigations showed those WASM assets to be quite large. We care about front-end performance and weren’t keen to add large assets when alternative approaches were available (however it is worth noting that the OPA team have improved this area quite a lot since then and we plan to re-evaluate our decision in the future). That was where partials came into the picture. We built an API as a proxy to the OPA with the intention to serve partials, and an interpreter for them running inside the web application. Essentially the partial is a decision tree waiting for certain variables to be filled in with local (to the browser) information. The interpreter does that, then runs through the tree to execute all logical operations and conclude on whether a page element should be displayed or not. Some preliminary testing with very complex policies referencing particular resource IDs made our interpreter slow to a non-practical degree, but we have always been able to circumvent those limitations until now. We are still relatively early in this journey, and expect to put more effort in making it possible to work with such complex policies in the future. That being said, we already reap the benefits of this approach:
The comment above concerns a case where our product manager requested changes in user permission management for access to pipeline environments, and by simply changing the Rego policy without any code changes whatsoever everything magically worked as intended both in the backend and the UI.
A few final words
While it took some time to put all this together and apply it to our first use cases, we strongly believe it was worth the effort. It feels like a superpower to be able to express policies in such a uniform (and testable!) way for all the features we surface in the console, and expect this setup will serve us well for many years to come.
If all this looks interesting to you, you speak a bit of Scala and care about code quality, we are hiring and would love to hear from you!