Episode 13 – Data Integrity and Baby Bunnies

rules of normalization

rules of normalization

In the modern world more and more it is the data that is important and not any particular bit of hardware.  (oh and yes a bunny did die in the creation of the stories in this podcast) Certainly its true with me that data is what’s important.  In this episode we talk with Teresa Burger about the importance of referential integrity and data architecture and why you should care and why some people should just step away for the database before you hurt they hurt themselves or their data… and generally why people and business’s should be more OCD about their sacred and holy data…

Thou Shalt Normalize Your Data or the hounds of h3ll will come to corrupt your integrity…

Check out Teresa on twitter at: https://twitter.com/t_burger

or her blog here: http://geekswithblogs.net/tburger/Default.aspx

One thought on “Episode 13 – Data Integrity and Baby Bunnies

  1. Hi David,

    I stumbled upon your blog while looking at Vishal’s linkedin profile and started listening to your podcast. Data integrity (normalized vs. denormalized) is such an interesting topic that I can’t help myself to post this comment.

    I think referential integrity have very little to do with scalability. Perhaps it impacts the “complexity” scalability – the ability for a data structure to support new unintended new relationships in the future, but I don’t think it has to do alot with scalability in the traditional sense.

    On the opposite, just like Teresa pointed out at the end of the episode which negates the earlier point, the mechanism to maintain data integrity (ACID) can actually hurt scalability. CAP theorem suggest that any (distributed) system can only guarantee 2 out of these 3 characteristics: consistency (loosely attributing to data integrity), availability, partitioning (loosely attributing to scalability).

    Denormalizing also do not necessarily introduce concurrent locking issues. It is actually the opposite that normalized structure would increase the likelihood of a transaction to occur where it halts other requests.This is why document databases are much faster than sql server in writing data. Read/write or multiple writes collission are more unlikely to occur due to the disparate tables, and when they do it’s easily resolvable.

    Data integrity also do not address the “master data” issue that Teresa depict in her query scenario. Having a perfectly normalized database do not guarantee the hygiene of the data. There could be still duplicate entry. e.g. John Doe registered twice into the system. There could also exist incorrect roll-ups e.g. John Doe from Canada is different to John Doe from the US and therefore need to be separate records. These scenarios can be both addressed in normalized or denormalized data structure – depending on the use of the data.

    In invoicing/record management scenarios, denormalized structure is the right thing to do. You don’t want to ever accidentally change an address on an invoice just because the invoice table do not contain the address value at the time of the order, but instead a “live” reference to the customer profile entity. There are many more other scenarios where denormalization makes sense; Summary table in BI, high velocity/variant/volume scenarios (Big Data) and many more.

    Congrats on the good start. Keep it up. Cant wait for the next episode!

    Ron

Leave a Reply