Setting additionalProperties to True in Iglu JSON Schemas
When defining event and context schemas in Iglu, developers often face a trade-off between schema flexibility and data integrity. The additionalProperties attribute in JSON Schema is one such lever that can significantly impact data processing in Snowplow pipelines. In this post, we examine the implications of setting additionalProperties to true in Iglu JSON Schemas.
Q: What is the purpose of the additionalProperties attribute in JSON Schema?
The additionalProperties attribute defines whether or not properties not explicitly defined in the schema are allowed. By default, this is set to true, meaning that any extra properties will not cause validation to fail. However, when set to false, any unexpected property will result in schema validation failure.
Q: What are the pros and cons of setting additionalProperties to true?
Pros:
- Increased Flexibility: Developers can add new properties to events or contexts without updating the schema, ensuring uninterrupted data collection.
- Reduced Overhead: Eliminates the need for frequent schema versioning, especially in rapidly evolving data environments.
Cons:
- Data Accessibility: Extra properties are not available in downstream data models, as they are not explicitly defined in the schema.
- Data Governance: Potentially unstructured data can enter the pipeline, complicating data quality management.
Q: How does setting additionalProperties to true impact Snowplow data processing?
If additionalProperties is set to true, the extra properties will be retained in the raw JSON but will not be shredded and loaded into Redshift or other data warehouses. This means that while the data is technically ingested, it is not easily accessible for analysis without custom processing.
Q: Are there best practices for using additionalProperties in Iglu schemas?
Yes, consider these best practices:
- Selective Flexibility: Only set additionalProperties to true for specific contexts where dynamic data collection is essential, such as A/B testing contexts.
- Structured Metadata: Instead of allowing arbitrary properties, consider using nested structures or arrays to capture dynamic data while maintaining schema integrity.
- Schema Versioning Strategy: Even with additionalProperties enabled, establish a versioning strategy to formalize newly introduced properties.
Final Thoughts
While setting additionalProperties to true can provide flexibility for rapid data collection, it can also lead to potential data quality issues if not managed properly. Snowplow users should weigh the benefits of schema flexibility against the need for consistent data structure, especially when dealing with production-grade pipelines.