Schemas often turn out to be significant barriers to innovation – adding a feature that requires a schema change brings with it the difficult and time consuming task of actually changing the schema in your live database and migrating historical data. I know from first-hand experience with large systems that this frequently turns out to be the most time consuming and complex aspect of launching new versions of software.
Lately I’ve been more and more tempted by data stores with relaxed schemas. In an attempt to better understand the pros and cons of schemas I’m enumerating my thoughts, mostly from a user’s perspective, and hoping that some of my DB expert readers (I know you’re reading this, if you don’t respond here I’ll bug you in person!) will chime in with their thoughts on the pros of strict schemas.
The Aesthetics. From a developer’s perspective I find strict-schema systems (general RDBMSes) confining – some rows simply want to have more columns than others. I can model it with relationships and foreign keys, but the multi-way joins are somehow ugly to me, and I know they make my performance minded DBA buddies upset. It’s also just not pretty, splitting up this field that naturally wants to sit in a row with the rest of his friends into his own separate table.
The Exceptions. I’ve also seen, in literally every large scale data analysis I’ve done, that even in strict-schema systems junk gets into the system. Just last week some analysis blew up because a field that could not be NULL was in fact NULL in some odd cases. The DBA can’t explain why, but it’s there in the data.
The Discontinuity. Changes do in fact have to happen even in strict schema systems. When they do, there’s an ugly breakage – tools and code deal only with before-change or after-change versions of the schema, not both, because they’re built assuming the schema is well defined and static. I’m looking at the effects of this right now – in one project the current analysis tools can’t be applied before a certain date, because the schema changed on that date.
The Static In A Dynamic System. We’ve found freedom in dynamic languages, allowing us to define our data structures in fluid, natural ways. Strict schemas feel like an injection of Java in a Python system: over-defined, confining, and unnatural. The data store should be as flexible as the programming language – data is code and code is data, or something like that.
…
To be fair, schema-less sytems have plenty of cons as well. Truth be told I don’t have very much real-world experience with them, so most of the cons are probably hidden from me, but I do know that schema-less systems are generally a bear to deal with after they’ve existed for a few years. Zombie objects and data infest your data store and crumb up the works. Tooling, general knowledge, and best practices are also severly lacking compared to static-schema systems, at least at this point in time.
…
So I’m very tempted to go and try a system with less strict schema requirements. Perhaps something like the FriendFeed MySQL setup. If you can enlighten me on the benefits I’d be foregoing by going schema-less, please leave a comment.