Piinguin, Snowplow PII usage management service, released

This is a detailed technical walkthrough of Piinguin; to learn more about what Piinguin does and why we built it, see the Piinguin introduction post.
We are pleased to announce the first release of Piinguin and the associated Snowplow Piinguin Relay. This initial release introduces basic capabilities for managing the usage of personally identifiable information data from Snowplow.
Read on for more information on Piinguin and the Snowplow Piinguin Relay:
1. Overview
Following the release of Snowplow R106 Acropolis, which added the capability to emit a stream of PII transformation events, we have continued to develop tools to support the responsible management of personally identifiable information.
If you want to learn more about PII and how it is managed by the Snowplow PII enrichment, you can read more in the release posts for Snowplow R100 Epidaurus and R106 Acropolis.
Piinguin aims to round out our approach to PII management, by providing a service which stores PII and helps control access by requiring that anyone who reads PII data provides a justification based on the lawful basis for processing PII specified under GDPR.
Piinguin consists of several elements that sit alongside Snowplow to store and serve PII data. Here is an overview of the architecture:
The first component that receives data out of Snowplow’s stream of PII transformation events is the Snowplow Piinguin Relay, an AWS Lambda function which uses the piinguin-client artifact to send data to Piinguin. You can read more details about this relay below, and detailed instructions on how to install and run it in the deploying section.
The second component is the piinguin-server itself which has to be in the same secure VPC as the Lambda function. In addition it needs to have access to an AWS Dynamo DB table to store the data. You can read more details about Piinguin below, along with detailed instructions on how to install and run it under deploying.
The final component is the aforementioned piinguin-client, potentially running embedded in your own code to manage your interactions with the PII stored in Piinguin. This client library is discussed in more detail in the upcoming Piinguin section.
2. Piinguin
The Piinguin project consists of three parts. These are the:
- Protocol
- Server
- Client
Piinguin is based on GRPC which is a Protocol Buffer-based RPC framework. The protocol in the Piinguin project specifies the interface between the client and server. There is a .proto
file which describes the interactions between the client and the server for reading, writing and deleting PII records. That file is used with the excellent scalapb Scala compiler plug-in to generate Java
code stubs for both the server and the client. These can then be used to implement any behavior based on that interface.
The piinguin-server implements the behavior of the server according to the interface, which in this case means writing to and reading from DynamoDB using another excellent library, scanamo. In the highly unlikely event (as unlikely as a hash collision) that a hash coincides for two values, the last seen original value will be kept. (There are thoughts of keeping all values in that case, although their utility is dubious – feel free to discuss in the relevant issue on GitHub.)
Finally, the piinguin-client artifact provides a client API for use from Scala
. There are three ways to use the client API: with plain Scala Futures, FS2 IO, and FS2 Streaming. Please note that the FS2 Streaming implementation remains highly experimental and its use is currently discouraged as it is likely to change significantly; any and all comments and PRs are of course welcome.
3. Snowplow Piinguin Relay
The Snowplow Piinguin Relay uses the aforementioned piinguin-cient in an AWS Lambda function to forward all PII transformation events to the piinguin-server.
The relay uses the Snowplow Analytics SDK to read the PII transformation enriched events that are contained in the Kinesis stream and extract the relevant fields (currently, the modified and original value only), and perform a createRecord
operation against piinguin-server.
4. Deploying
Both the Piinguin Server and the Piinguin Relay currently support AWS only, and they should be deployed to the same VPC.
4.1 Configuring the Snowplow Piinguin Relay
You can obtain the relay artifact from our S3 public assets buckets appropriate for your region.
In order for you to create an AWS Lambda function, please follow the detailed developer guide. When creating the Lambda, make sure to:
- Specify as trigger the AWS Kinesis stream that contains your PII transformation events, as produced by Snowplow
- Provide the ID of the VPC where you are running the Piinguin Server
- In the
Environment variables
section, you will need to add thePIINGUIN_HOST
,PIINGUIN_PORT
andPIINGUIN_TIMEOUT_SEC
The PIINGUIN_TIMEOUT_SEC
value should be lower than the AWS Lambda timeout in order to get a meaningful error message if the client times out while communicating with the server. Here is an example of that configuration:
PIINGUIN_HOST = ec2-1-2-3-4.eu-west-1.compute.amazonaws.com PIINGUIN_PORT = 8080 PIINGUIN_TIMEOUT_SEC = 10
4.2 Setting up relay permissions to the VPC
As stated befo
re, both the relay and the Piinguin Server need to reside in the same VPC. In addition, the Lambda function needs to have sufficient access from IAM to run. You should create a service role and attach policies that will permit it to run following this guide. Like many Lambda functions, this one also needs permission to send its output to CloudWatch Logs – this IAM policy should cover that:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "logs:CreateLogGroup", "Resource": "arn:aws:logs:<region>:<account-id>:*" }, { "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": [ "arn:aws:logs:<region>:<account-id>:log-group:/aws/lambda/piinguin-relay:*" ] } ] }
As the Lambda will be reading its PII transformation events from Kinesis, it will also need to have permissions to do that, with a policy document such as:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "kinesis:*", "Resource": [ "arn:aws:kinesis:<region>:<account-id>:stream/<pii-events-stream-name>" ] } ] }
4.3 Deploying the Piinguin Server
The simplest way to deploy Piinguin Server is to obtain the Docker image by running the following on your Docker host:
$ docker run snowplow-docker-registry.bintray.io/snowplow/piinguin-server:0.1.1
This will run the server on the default port 8080
and will use the default DynamoDB table piinguin
. Both are configurable to other values using PIINGUIN_PORT
and PIINGUIN_DYNAMODB_TABLE
, if needed. See the relevant readme for more information.
4.4 Setting up server permissions to the VPC
As stated before, both the Relay and the Server need to reside in the same VPC. In addition, the Docker host needs to have sufficient access from IAM to run. You should create a service role and attach policies that will permit it to run following this guide.
As the server writes its data to DynamoDB it will need to have access to it with a policy document such as:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "dynamodb:DeleteItem", "dynamodb:GetItem", "dynamodb:PutItem", "dynamodb:Scan", "dynamodb:UpdateItem" ], "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/<table-name>" } ] }
4.5 Setting up the DynamoDB table
You will need to create the appropriate DyanamoDB table in order to use Piinguin.
To create a DynamoDb table, log-in as normal to your AWS console and type DynamoDB
into the services field and select DynamoDB from the list:
From the DynamoDB page, click create table
:
Finally, specify the desired table name
, set the primary key
to modifiedValue and its type to String, then click Create
.
If you are comfortable with the C
LI, you can also create the DynamoDB table using the following commands:
aws dynamodb create-table --table-name piinguin-prod --attribute-definitions AttributeName=modifiedValue,AttributeType=S --key-schema AttributeName=modifiedValue,KeyType=HASH --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1
With the DynamoDB table created, setup is now complete and you can use Piinguin.
4.6 Testing that Piinguin is functioning
One way to verify that your setup works is to checkout the Piinguin project on GitHub and try to write and the read back a record:
$ sbt "client/console" scala> import scala.concurrent.{ExecutionContext, Await} scala> import scala.concurrent.duration._ scala> import com.snowplowanalytics.piinguin.client.PiinguinClient scala> implicit val ec = ExecutionContext.global scala> val c = new PiinguinClient("localhost", 8080) scala> val createResult = Await.result(c.createPiiRecord("123", "456"), 10 seconds) createResult: Either[com.snowplowanalytics.piinguin.client.FailureMessage,com.snowplowanalytics.piinguin.client.SuccessMessage] = Right(SuccessMessage(OK)) scala> import com.snowplowanalytics.piinguin.server.generated.protocols.piinguin.ReadPiiRecordRequest.LawfulBasisForProcessing scala> val readResult = Await.result(c.readPiiRecord("123", LawfulBasisForProcessing.CONSENT), 10 seconds) readResult: Either[com.snowplowanalytics.piinguin.client.FailureMessage,com.snowplowanalytics.piinguin.client.PiinguinClient.PiiRecord] = Right(PiiRecord(123,456))
You can also verify that the record is in DynamoDB by clicking on items
in the console:
And verifying that your item is there.
5. Getting help
For more details on working with Piinguin and the Snowplow Piinguin Relay, please check out the documentation here:
If you have any questions or run into any problems, please visit our Discourse forum.