Careful Trading Complexity for 'Improvements'

Nov 2021

Often I’ve come across technical proposals along the lines of:

In 6 months we will outgrow our MySQL/Postgres instance. We will need to move our biggest table to a different horizontally scalable datastore.
If we have a database outage in a region, we will have a complete outage. We should consider moving to a data-store that’s natively multi-region.
This would be much faster if it was stored in a specialized database. Should we consider moving to it?
If we move to an event-based architecture, our system will be much more reliable.

What these proposals have in common is that they attempt to improve the system by increasing complexity. Whenever you find yourself arguing for improving infrastructure by yanking up complexity, you need to be very careful.

“Simplicity is prerequisite for reliability.” — Edsger W. Dijkstra:

Theoretically yes: if you move your massive, quickly-growing products table to a key-value store to alleviate a default-configured relational database instance, it will probably be faster, cost less, and easier to scale.

However, in reality most likely the complexity will lead to more downtime (even if in theory you get less), slower performance because it’s hard to debug (even if in theory, it’s much faster), and worse scalability (because you don’t know the system well).

More theoretical 9s + increase in complexity => less 9s + more work.

This all because you’re about to trade known risks for theoretical improvements, accompanied by a slew of unknown risks. Adopting the new tech would increase complexity by introducing a whole new system: operational burden of learning a new data-store, developers’ overhead of using another system for a subset of the data, development environment increases in complexity, skills don’t transfer between the two, and a myriad of other unknown-unknowns. That’s a massive cost.

I’m a proponent of mastering and abusing existing tools, rather than chasing greener pastures. The more facility you gain with first-principle reasoning and napkin math, the closer I’d wager you’ll inch towards this conclusion as well. A new system theoretically having better guarantees is not enough of an argument. Adding a new system to your stack is a huge deal and difficult to undo.

So what do we do with that pesky products table?

Stop thinking about technologies, and start thinking in first-principle requirements:

You need faster inserts/updates
You need terabytes of storage to have runway for the next ~5 years
You need more read capacity

The way that the shiny key-value store you’re eyeing achieves this is by not syncing every write to disk immediately. Well, you can do that in MySQL too (and Postgres). You could put your table on a new database server with that setting on. I wrote about this in detail.

There’s no reason your relational database can’t handle terabytes. Do the napkin math, log(n) lookups for that many keys isn’t much worse. Most likely you can keep it all to one server.

Why do you think reads would be faster in the other database than your relational database? It probably caches in memory. Well, relational databases do that too. You need to spread reads among more databases? Relational databases can do that too with read-replicas…

Yes, MySQL/Postgres might be $25-50\%$ worse at all those things than a new system. But it still comes out $10,000\%$ ahead, by not being a new system with all its associated costs and unknown-unknowns. There’s an underlying rule from evolution that the more specialized a system is, the less adaptable to change it is. Whether it’s a bird over-fit to its ecosystem, or a database you’re only using for one thing.

We could go through a similar line of reasoning for the other examples. Adopting a new multi-regional database for a subset of your database will likely yield to more downtime due to the introduction of complexity, than sticking with what you’ve got.

Don’t adopt a new system unless you can make the first-principle argument for why your current stack fundamentally can’t handle it. For example, you will likely reach elemental limitations doing full-text search in a relational datastore or analytics queries on your production database, as a nature of the data structures used. If you’re unsure, reach out, and I might be able to help you!

Subscribe through email, RSS or Twitter to new articles!

3,769 subscribers

You might also like...