Huskimo 0.3.0 released: warehouse your Twilio telephony data in Redshift

We are pleased to announce the release of Huskimo 0.3.0, for companies who use Twilio and would like to analyze their telephony data in Amazon Redshift, alongside their Snowplow event data.
For readers who missed our Huskimo introductory post: Huskimo is a new open-source project which connects to third-party SaaS platforms (Singular and now Twilio), exports their data via API, and then uploads that data into your Redshift instance.
Huskimo is a complement to Snowplow’s built-in webhook support – Huskimo exists because not all SaaS services offer webhooks which expose their internal data as a stream of events. Note that you do not need to use Snowplow to use Huskimo.
Read on after the jump for:
1. Twilio support
Overview
Twilio is a cloud telephony service, used by many thousands of companies to develop and operate call, voicemail and texting systems.
Version 0.3.0 of Huskimo supports five resources made available through the Twilio API. These are set out in the following table:
Resource | Fetch | Table |
---|---|---|
Calls | New based on StartTime |
twilio_calls |
Messages | New based on DateSent |
twilio_recordings |
Recordings | New based on DateCreated |
twilio_messages |
IncomingPhoneNumbers | All | twilio_incoming_phone_numbers |
pricing.PhoneNumbers | All | twilio_pricing_phone_numbers |
For each resource type, Huskimo will retrieve records from the Twilio RESTful API, convert them into a simple TSV file format, and load them into Redshift. Note that we do not extract any sub-resources from these Twilio resources, so there are no child tables for any of these five resources in Redshift.
In the next section we will explain the algorithm used by Huskimo to extract this data from Twilio.
Extraction algorithm
Every time Huskimo runs to extract data from Twilio, it should:
- Connect to Twilio using the credentials in the configuration file
- For IncomingPhoneNumbers, fetch all data from that resource
- For Calls, Messages and Recordings, fetch all data for the given day
- For pricing.PhoneNumbers, fetch all data using a bespoke algorithm
- Upload the Twilio usage data into each Amazon Redshift database specified in the configuration file
Note that this behavior is different from how Huskimo extracts data for Singular: because marketing data is difficult to finalize, Huskimo fetches spend data from Singular for each of the past N days (the default is 30), every time Huskimo runs. By contrast we treat Twilio’s telephony data as “golden” as soon as it is available, and so there is no equivalent “lookback” approach for Huskimo’s treatment of Twilio.
Algorithm for pricing.PhoneNumbers
The algorithm for retrieving pricing.PhoneNumbers from Twilio is as follows:
- Do a
GET
topricing.twilio.com/v1/PhoneNumbers/Countries
to get a list of Twilio’s countries - Loop through each country returned and do a
GET
topricing.twilio.com/v1/PhoneNumbers/Countries/{Country}
- Flatten each entry in the
phone_number_prices
array returned into its own row in the output table
Thus the output table in Redshift table twilio_pricing_phone_numbers
looks like this:
Estonia | EE | mobile | 3.00 | 3.00 | usd |
Estonia | EE | national | 1.00 | 1.00 | usd |
Estonia | EE | local | 0.50 | 0.50 | usd |
2. Other updates
The following general fixes have been applied:
- We have added support for SSL-secured Redshift databases (#21)
- We have fixed a bug in
deleteFromS3
where only the first 1000 files were deleted (#18) - We split
redshift-ddl.sql
intocommon-redshift-ddl.sql
andsingular-redshift-ddl.sql
(#15)
We have also made some updates to Huskimo’s Singular support:
- Huskimo now allows
null
forcreative_name
insingular_creatives
(#19) - We fixed macros in fetchAndWrite’s Exception (#16)
- Singular now only fetches channels of type
singular
(#14) - We partially fixed an issue where Akka prevents clean exit on Exception (#1) – the remainder of the fix should come in 0.3.1 (#24)
3. Running Huskimo
Running Huskimo consists of four steps:
- Install Huskimo
- Write the Hukimo config file
- Deploy the Redshift tables
- Schedule Huskimo to run nightly
We’ll cover each of these steps briefly in the next section.
Install Huskimo
Huskimo is made available as an executable “fatjar” runnable on any Linux system. It is hosted on Bintray, download it like so:
$ wget http://dl.bintray.com/snowplow/snowplow-generic/huskimo_0.3.0.zip
Once downloaded, unzip it:
$ unzip huskimo_0.3.0.zip
Assuming you have a recent (Java 7 or 8) runtime on your system, running is as simple as:
$ ./huskimo --config my-config.yml
Write the Huskimo config file
Huskimo is configured using a YAML-format file which looks like this:
fetch: lookback: 30 # Number of days back in time from the start date to fetch for channels: - name: ADD HERE type: singular api_user: # Leave blank for Singular api_key: ADD HERE - name: ADD HERE type: twilio api_user: ADD HERE # Twilio account sid api_key: ADD HERE # Twilio auth token s3: access_key_id: ADD HERE secret_access_key: ADD HERE region: ADD HERE # Region bucket lives in bucket: ADD HERE # Must be s3:// not s3n:// for Redshift folder_path: ADD HERE targets: - name: ADD HERE type: redshift # Only Redshift support currently host: ADD HERE # The endpoint as shown in the Redshift console database: ADD HERE # Name of database port: 5439 # Default Redshift port ssl: false # SSL disabled by default ssl_factory: org.postgresql.ssl.NonValidatingFactory # Disable SSL certificate validation by default table: ADD HERE # TODO: move this out username: ADD HERE password: ADD HERE
Key things to note:
- You can configure Huskimo to extract from one or more Singular or Twilio accounts
- You can configure Huskimo to write the extracted data to one or more Redshift databases
- Huskimo requires Amazon S3 details to power the
COPY
into Redshift
If you are upgrading from version 0.2.0 note the new fields:
- Under
channels
,api_user
(leave blank for Singular) - Under
targets
, the new fieldsssl
andssl_factory
to support the SSL security setting on Redshift databases
Deploy the Redshift tables
Before starting Huskimo you must deploy the relevant tables into Redshift. You can find the shared database setup in the file:
If you are extracting data from Twilio, run this script:
If you are extracting data from Singular, run this script:
Make sure to deploy this file against each Redshift database you want to load Singular or Twilio data into.
Schedule Huskimo to run nightly
You are now ready to schedule Huskimo to run daily.
We typically run Huskimo in the early morning so that the data for yesterday is already available (even if rather incomplete). A cron entry for Huskimo might look something like this:
30 4 * * * /opt/huskimo/huskimo-0.3.0 --config /etc/huskimo.yml
4. Getting help
For more details on this release, please check out the Huskimo 0.3.0 on GitHub.
We will be building a dedicated wiki for Huskimo to support its usage; in the meantime, if you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
5. Huskimo roadmap
We will be adding support for further SaaS platforms to Huskimo on a case-by-case basis.
We are particularly interested in adding support for more marketing channels, such as Google AdWords or Facebook. Having these datasets available in Redshift alongside your event data should enable some very powerful marketing attribution and return-on-spend analytics.
If you are interested in sponsoring a new integration for Huskimo, do get in touch!