In a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement, rather than a technology. This description immediately felt right; I've never been comfortable talking about NoSQL, which when taken literally, extends from the minimalist Berkeley DB (commercialized as Sleepycat, now owned by Oracle) to the big iron HBase, with detours into software as fundamentally different as Neo4J (a graph database) and FluidDB (which defies description).
But what does it mean to say that NoSQL is a movement rather than a technology? We certainly don't see picketers outside Oracle's headquarters. Justin said succinctly that NoSQL is a movement for choice in database architecture. There is no single overarching technical theme; a single technology would belie the principles of the movement.
Think of the last 15 years of software development. We've gotten very good at building large, database-backed applications. Many of them are web applications, but even more of them aren't. "Software architect" is a valid job description; it's a position to which many aspire. But what do software architects do? They specify the high-level design of applications: the front end, the APIs, the middleware, the business logic — the back end? Well, maybe not.
Since the '80s, the dominant back end of business systems has been a relational database, whether Oracle, SQL Server or DB2. That's not much of an architectural choice. Those are all great products, but they're essentially similar, as are all the other relational databases. And it's remarkable that we've explored many architectural variations in the design of clients, front ends, and middleware, on a multitude of platforms and frameworks, but haven't until recently questioned the architecture of the back end. Relational databases have been a given.
Many things have changed since the advent of relational databases:
- We're dealing with much more data. Although advances in storage capacity and CPU speed have allowed the databases to keep pace, we're in a new era where size itself is an important part of the problem, and any significant database needs to be distributed.
- We require sub-second responses to queries. In the '80s, most database queries could run overnight as batch jobs. That's no longer acceptable. While some analytic functions can still run as overnight batch jobs, we've seen the web evolve from static files to complex database-backed sites, and that requires sub-second response times for most queries.
- We want applications to be up 24/7. Setting up redundant servers for static HTML files is easy, but a database replication in a complex database-backed application is another.
- We're seeing many applications in which the database has to soak up data as fast (or even much faster) than it processes queries: in a logging application, or a distributed sensor application, writes can be much more frequent than reads. Batch-oriented ETL (extract, transform, and load) hasn't disappeared, and won't, but capturing high-speed data flows is increasingly important.
- We're frequently dealing with changing data or with unstructured data. The data we collect, and how we use it, grows over time in unpredictable ways. Unstructured data isn't a particularly new feature of the data landscape, since unstructured data has always existed, but we're increasingly unwilling to force a structure on data a priori.
- We're willing to sacrifice our sacred cows. We know that consistency and isolation and other properties are very valuable, of course. But so are some other things, like latency and availability and not losing data even if our primary server goes down. The challenges of modern applications make us realize that sometimes we might need to weaken one of these constraints in order to achieve another.
These changing requirements lead us to different tradeoffs and compromises when designing software. They require us to rethink what we require of a database, and to come up with answers aside from the relational databases that have served us well over the years. So let's look at these requirements in somewhat more detail.
Size, response, availability
It's a given that any modern application is going to be distributed. The size of modern datasets is only one reason for distribution, and not the most important. Modern applications (particularly web applications) have many concurrent users who demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related Changes and their User Impact, Eric Schurman and Jake Brutlag showed results from independent research projects at Google and Microsoft. Both projects demonstrated imperceptibly small increases in response time cause users to move to another site; if response time is over a second, you're losing a very measurable percentage of your traffic.
If you're not building a web application — say you're doing business analytics, with complex, time-consuming queries — the world has changed, and users now expect business analytics to run in something like real time. Maybe not the sub-second latency required for web users, but queries that run overnight are no longer acceptable. Queries that run while you go out for coffee are marginal. It's not just a matter of convenience; the ability to run dozens or hundreds of queries per day changes the nature of the work you do. You can be more experimental: you can follow through on hunches and hints based on earlier queries. That kind of spontaneity was impossible when research went through the DBA at the data warehouse.
Whether you're building a customer-facing application or doing internal analytics, scalability is a big issue. Vertical scalability (buy a bigger, faster machine) always runs into limits. Now that the laws of physics have stalled Intel-architecture clock speeds in the 3.5GHz range, those limits are more apparent than ever. Horizontal scalability (build a distributed system with more nodes) is the only way to scale indefinitely. You're scaling horizontally even if you're only buying single boxes: it's been a long time since I've seen a server (or even a high-end desktop) that doesn't sport at least four cores. Horizontal scalability is tougher when you're scaling across racks of servers at a colocation facility, but don't be deceived: that's how scalability works in the 21st century, even on your laptop. Even in your cell phone. We need database technologies that aren't just fast on single servers: they must also scale across multiple servers.