
Chapter 7 - Building Trust in Your Data
6
VALIDATION
Once your data dictionary is agreed upon, set, and implemented, you’ll want
to make dirty data stays out of the picture. To do so, you’ll need a method for
validating how new user actions make their way to your codebase.
Even with rigorous naming conventions and instrumentation instructions, errors
will be introduced if engineers don’t receive automated feedback to help them
identify and resolve issues during implementation. And when you’re responsible
for reviewing thousands of lines of code across dozens of events, it’s inevitable
that mistakes will happen.
A single tracking error on a business-critical event, like ‘Lead Captured’ ,
can cost your business hundreds of thousands of dollars. The problem is that
these bugs are typically detected weeks or months later, and by that time,
the damage has been done.
Time is of the essence, so it’s important to detect mistakes before they make their
way to your production environment. Rather than manually trying to compare event
payloads against your data dictionary, you’ll want a way to automatically confirm
when data matches your spec and alert you when it doesn’t. There’s a lot of ways
to go about this, but (naturally) we prefer using our Protocols product to either
1) send a daily digest of current and new violations or 2) enable violation event
forwarding to send violations as .track() calls to a Segment Source.
ENFORCEMENT
To really take data quality to the next level, you can implement a system and
standards for data enforcement.
There’s a wide range of variables to consider when it comes to enforcement.
On the lighter side, you may want to only block PII data from reaching tools where
it can be seen by anyone with access to said tool. Or (in more developed use cases)
you may want to completely block all data which does not match your spec or
schema from reaching any downstream tools. At first, it’s probably best to start
on the lighter spectrum of enforcement and slowly make your way to the other end
of the spectrum.
If discarding data from blocked events sounds scary, there are precautions
you can take to ensure no data is lost while still enforcing standards. For example,
you could configure an isolated data warehouse to send data which doesn’t meet
your enforcement standards. Doing so will ensure that no data loss, and you
could retroactively get discarded data into necessary tools with a bit of analytics
and data engineering help.