Snowplow R108 Val Camonica with batch pipeline encryption released
We are pleased to announce the release of Snowplow 108 Val Camonica, named after [the collection of stone carvings in northern Italy][val-camonica].
This release enables the Snowplow batch pipeline to operate fully encrypted. This effort builds upon what is already possible “out of the box” regarding encryption in the Snowplow pipeline, namely:
- Encrypting Kinesis streams at rest
- Encrypting Elasticsearch data at rest
- Encrypting Redshift data at rest
This release brings the ability to have end-to-end encryption for the batch pipeline by making it possible to:
- Encrypt data at rest on S3
- Encrypt data at rest on the local disks in your EMR cluster
- Encrypt data in-transit in your EMR cluster
This release also makes some minor – but important – improvements to the batch pipeline’s Clojure Collector.
Please read on after the fold for:
- Enabling end-to-end encryption for the batch pipeline
- Additional EmrEtlRunner features
- Fixing the Clojure Collector’s cookie path handling
- Upgrading
- Roadmap
- Help
![val-camonica][val-camonica-img] The Camunian Rose – Luca Giarelli / CC-BY-SA 3.0
1. Enabling end-to-end encryption for the batch pipeline
It is possible to setup end-to-end encryption for the batch pipeline running in Elastic MapReduce. For context, we recommend checking out Amazon’s dedicated guide to EMR data encryption.
In order to set up end-to-end encryption, you will need a couple of things:
- Encryption of your S3 buckets
- An appropriate EMR security configuration
1.1 Encrypting S3 buckets
For at rest encryption on S3, the buckets with which EmrEtlRunner will interact must have SSE-S3 encryption enabled – this is the only mode we currently support. For reference, you can look at Amazon’s dedicated guide to S3 encryption.
Keep in mind that switching on this setting is not retroactive. If you want to have only encrypted data in your bucket, you will need to go through the existing data and copy it in place.
Also, if you are using the Clojure Collector, SSE-S3 encryption needs to be set up at the bucket level, not the folder level, in order to take effect.
Once this is done, you will need to tell EmrEtlRunner that it will have to interact with encrypted buckets through the aws:s3:buckets:encrypted: true
configuration setting.
1.2 Setting up an appropriate EMR security configuration
Elastic MapReduce offers EMR security configurations, which let you enforce encryption for various aspects of your job. The options are:
- Encrypt data at rest on S3 when using EMRFS
- Encrypt data at rest on local disks
- Encrypt data in-transit
For a complete guide on setting up a EMR security configuration, you can refer to Amazon’s dedicated guide to EMR security.
Once you’ve performed this setup, you can specify which security configuration EmrEtlRunner should use through the aws:emr:security_configuration
EmrEtlRunner configuration option, which we will cover in the Upgrading section below.
Let’s review each of these three EMR encryption options to understand their impact on our Snowplow batch pipeline.
1.2.1 At rest encryption on S3 when using EMRFS
This specifies the strategy to encrypt data when EMR interacts with S3 through EMRFS. By default, even without encryption setup, data is encrypted while in transit from EMR to S3.
Note that, currently, the batch pipeline does not make use of EMRFS, instead it copies data from S3 to the HDFS cluster on the EMR nodes, and from HDFS to S3, through S3DistCp steps; more on that in the next section.
1.2.2 At rest encryption on local disks
When running the Snowplow pipeline in EMR, an HDFS cluster is setup on the different nodes of your cluster. Enabling encryption for the local disks on those nodes will have the following effects:
- HDFS RPC, e.g. between name node and data node, uses SASL
- HDFS block transfers (e.g. replication) are encrypted using AES 256
- Attached EBS volumes are encrypted using LUKS
When enabling this option, please keep the following drawbacks in mind:
- EBS root volumes are not encrypted, you need to use a custom AMI for that: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-custom-ami.html
- KMS key usage is subject to pricing: https://aws.amazon.com/kms/pricing/
- It has a performance impact, e.g. when writing your enriched data to HDFS
To setup this type of encryption you will need to create an appropriate KMS key (refer to Amazon’s KMS guide for more information). This key needs to be in the same region as the EMR cluster.
It is important to note that the role used in aws:emr:jobflow_role
in the EmrEtlRunner configuration needs to have the kms:GenerateDataKey
policy for this setting to work.
This policy will be used to generate the necessary data keys using the “master” key created above. Those data keys are, in turn, used to encrypt pieces of data on your local disks.
1.2.3 In-transit encryption (Spark and MapReduce)
When running the Spark jobs of the Snowplow pipeline (Enrich and Shred), and running some S3DistCp steps (e.g. using --groupBy
or --targetSize
), data is shuffled around the different nodes in your EMR cluster. Enabling encryption for those data movements will have the following effects:
- MapReduce shuffles use TLS
- RPC and data transfers in Spark are encrypted using AES 256 if emr >= 5.9.0, otherwise RPC is encrypted using SASL
- SSL is enabled for all things HTTP in Spark (e.g. history server and UI)
Be aware that this type of encryption also has a performance impact as data needs to be encrypted when sent over the network (e.g. when running deduplication in the Shred job).
To set up this type of encryption, you will need to create certificates per Amazon’s PEM certificates for EMR guidance.
Please note: for this type of encryption to work, you will need to be in a VPC and the domain name specified in the certificates needs to be *.ec2.internal
if in us-east-1 or *..compute.internal
otherwise.
2. Additional EmrEtlRunner features
This release also brings some ergonomic improvements to EmrEtlRunner:
- There is a new
--ignore-lock-on-start
option which lets you ignore an already-in-pl
ace lock, should one exist. Note that the lock will still be cleaned up if the run ends successfully - It is now possible to specify the Snowplow collector’s port and protocol for EmrEtlRunner observability, through
monitoring:snowplow:{port, protocol}
- Under the hood, EmrEtlRunner now uses the official AWS Ruby SDK instead of our now-retired Sluice library. This should greatly help with memory consumption
3. Fixing the Clojure Collector’s cookie path handling
Up until this release, the Clojure Collector defaulted to having the parent path of the requested collector endpoint as path for the network_userid
cookie being set. For example, if you were to use:
- the pixel endpoint (
my-collector.com/i
), the cookie path would be/
- the Iglu webhook endpoint (
my-collector.com/com.snowplowanalytics.iglu/v1
), the cookie path would be/com.snowplowanalytics.iglu/
.
This would lead to the network_userid
being unintentionally different for the same user across the different event collection paths.
With R108, the cookie path will always default to /
, no matter the endpoint hit. This can be overridden through the SP_PATH
Elastic Beanstalk environment property.
Finally, we’ve updated a good number of dependencies in the Clojure Collector.
4. Upgrading
This release applies only to our AWS batch pipeline – if you are running any other flavor of Snowplow, there is no upgrade necessary.
4.1 Upgrading EmrEtlRunner
The latest version of EmrEtlRunner is available from our Bintray here.
To use the latest EmrEtlRunner features, you will need to make the following changes to your EmrEtlRunner configuration:
For a complete example, see our sample config.yml
template.
4.2 Upgrading the Clojure Collector
The new Clojure Collector is available in S3 at:
s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.0-standalone.war
To customize your cookie path to not default to /
, make sure to specify the SP_PATH
Elastic Beanstalk environment property as described above.
5. Roadmap
Upcoming Snowplow releases are:
- R109 Mileum, which will introduce various new features to our real-time pipeline, particularly the Scala Stream Collector
- R110 Vallei dei Templi, porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs
6. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.