Introduction to NoSQL
Lecture 12 — Data Engineering — Spring 2015
February 19, 2015
Credit Where Credit is Due
- For this lecture, I've drawn on material from two books
Scaling with
Traditional Databases
Marz and Warren's book starts with an example that illustrates how an organization can be driven to Big Data
by trying to scale a system with traditional databases, i.e. relational database technology
Web Analytics Application (1)
- Imagine you've been asked to build a system that keeps track of page views on particular URLs
- You might build a relational table that looks like this
| Column |
Type |
| id |
integer |
| user_id |
integer |
| url |
varchar(1024) |
| pageviews |
bigint |
Web Analytics Application (2)
- When someone loads a page on your client's website, it pings your service and you increment the count associated with that URL
- Your client can then query that count when needed for reports or to display the count at the bottom of a web page.
- You deploy the system and it's a big success
Web Analytics Application (3)
- Indeed, it's too big of a success. You get multiple clients tracking multiple pages and they have lots of customers
- You start to get errors that indicate that clients are timing out when attempting to notify your system of a page view
- The problem is that so many requests are coming in at once that the relational database cannot mutate rows fast enough (i.e. writes) to keep up with demand.
- In addition, its spending so much time on writes that it can't keep up with the requests for the current count of a particular page (i.e. reads).
- Something must be done!
Web Analytics Application (4)
- You decide that the problem is that it is inefficient to have one update to the database per request. You should instead batch updates to the database, asking it to update multiple rows with each request
- You insert a queue between your web server and the code that will update the database
- You attach a single worker to the queue; that worker reads 1000 updates off the queue and then inserts them all into the database at once.
- This change drastically reduces the number of write requests
- The database can handle read requests as normal without the use of a queue.
- Happy Days! Your system is working once again!
Web Analytics Application (5)
- But, you discover this solution only works temporarily
- Your system gets even more popular flooding your queue with write requests. The database again starts to struggle to keep up with all of the writes that it must do and this, in turn, impacts all of the requests to your system causing many of them to time out
- You briefly try to add more workers to the queue in an attempt to speed thing up with concurrency but it doesn't work
- Why?
Web Analytics Application (6)
- It doesn't work because all of the workers still have to write to a single database
- The database is the BOTTLENECK
- What is the answer to this problem in the relational world?
Web Analytics Application (7)
- You have to SHARD the database
- What does that mean?
- It means that you need multiple copies of the database
- You then PARTITION your data across those databases
- To do that, you have to develop a partitioning strategy
- often, you will take an MD5 hash of some aspect of the input data and then mod that value by the number of shards
- You then write the data to the indicated shard
- You do the same thing for reads to locate the data needed to fulfill a request
Web Analytics Application (8)
- Problems?
- Sharding is an application-level concern: you have to manage the number of shards.
- If the number of shards changes, you have to remap your entire database across all of those shards; while you're doing that, you have to turn your application off; you can't increment the data while your resharding
- If you make a mistake when doing a
re-shard
, guess who has to fix it? You. That translates into MORE work and more downtime.
Web Analytics Application (9)
- You are now in a situation where you have lots of database instances to manage on potentially lots of machines.
- All of the machines are needed; if one goes down, the whole system goes down!
- And this brings you to the realm of fault tolerance
- It's not if your machine goes down, it's WHEN it goes down
Web Analytics Application (10)
- Now you have to add complexity to your application architecture:
- If a shard goes down, and you want to keep your application running, perhaps you add an in-memory queue that holds updates to that shard as
pending
which will get written to the shard once the machine comes back up
- What if the machine went down, however, because the disk went bad? Are you backing this information up? How?
- If you replicate your master nodes (i.e. the shards), you might allow clients to read information from the replicated server but you might not want to write to that server; because, at that point, which one is the
real
database and which one is the copy?
Web Analytics Application (11)
- The final problem: Humans!
- Imagine you're working in this environment and you make a mistake and deploy a worker script that has a bug in it
- It's not a crashing bug but a bug that writes the wrong data in some way
- information goes to the wrong shard when written by one worker and then can't be found by another worker when it wants to read it
- Or the worker accidentally increments the count in the wrong way: it's writing the WRONG information into the database
- And, recall, we're mutating the database; we are not storing previous values of the columns in our rows
- So the RIGHT information cannot be recovered
Analysis of the Problems with the Traditional Approach
- Marz and Warren end their example with the following analysis
- Scaling your application added significant complexity to it and to its service and persistence tiers (i.e. the
back-end
)
- You started with a web-server and ended with queues, workers, shards, and replicas, and all
glue code
required to make it all work
- With these changes came significant
operational complexity
as well
Analysis of the Problems with the Traditional Approach
- The challenges encountered are:
- Fault Tolerance is Hard: Keeping the system running in the presence of hardware failure required queues and replica servers all managed by hand
- Complexity is pushed to the application layer: the distributed nature of your persistence tier is not hidden from the application code in your service tier; your application knows the number of shards, it has to compute where information is stored, for queries that span shards, it has to manage the entire query process
Analysis of the Problems with the Traditional Approach
- The challenges encountered are:
- Lack of human fault-tolerance: there is nothing protecting you from a human making a mistake causing data to go missing or having incorrect data stored into the system overwriting correct data
- Maintenance is an enormous amount of work: you have to manage all of the complexity of the back-end yourself. Are the machines running? Properly configured? On the network? Is all the software up and running? Is it time for a re-shard?
NoSQL to the Rescue!
- NoSQL databases are ones which are aware of their distributed nature
- They manage sharding and replication FOR you!
- They are horizontally scalable
- If you need more disk space, add a server
- If you need computation to go faster, add a server
- When you add a server, the NoSQL database will re-shard for you automatically!
NoSQL to the Rescue!
- NoSQL databases tend to avoid mutable data
- You can't lose
correct
data because...
- ... once it is written it is immutable and can't be updated
- Instead, if a value changes, you write a new immutable copy of the updated data and...
- ... when you read that value, you adopt a strategy of returning the most recently written value
- and provide an option to go back in time to previous values if you need it
NoSQL to the Rescue!
- Finally, NoSQL databases are fault tolerant
- If a disk error takes a machine down, the NoSQL database switches to its replica automatically
- It reshards the database automatically to handle new writes and ensures those writes are also replicated
- When the old machine comes back, it reshards again and adjusts the replicas as needed
- All of these things happen in the persistence tier...
- ... your service tier / application code can be completely unaware that any of that is going on behind the scences!
Types of NoSQL Databases
- Key-Value
- Graph
- Columnar
- Documents
Key-Value Stores
- Does what it says on the tin
- A key-value store is a simple database that when presented with a string (i.e. a key) returns an arbitrarily large set of data (i.e. the value)
- Key value stores have no query language. They act just like hash tables from programming languages
- Values are untyped; you can store any type of data in these databases
- Benefits: Simplicity!
- Examples: Amazon SimpleDB, S3, Redis, Voldemort, Riak
Graph Stores
- Databases that are optimized to store graph structures rather than table/row/column structures
- Provide structural query languages so you can locate information based on the structure of your data
- Example: find all pairs of Person nodes who have at least three children together, live in Colorado, and have been married for more than 15 years
- Provide the ability to do graph traversals efficiently
- Provide the ability to calculate shortest paths between two given nodes; locate a nodes neighbors across n hops, calculate graph-related metrics, etc.
- Examples: Neo4J, Titan, Infinite Graph, InfoGrid
Columnar Stores (1)
- Also known as Column Family Stores
- Able to scale to enormous amounts of data
- Often able to achieve very fast writes (milliseconds)...
- ... while also maintaining reasonable read performance
- This is sometimes complicated by the fact that what you might read from one of these stores is very large
- Consider: Netflix uses Cassandra to store and serve it's movies
- In this case, what you are reading from the database is an entire movie that is then streamed across the Internet!
Columnar Stores (2)
- Basic Data Model
- Column Family: think of this as a table of related data
- Column familes consist of rows that have unique row keys
- Rows consist of columns (potentially millions of them)
- Columns consist of a key and a value
- The value itself might be a JSON map that in turn has keys and values
- In other words: Hash tables all the way down
- More accurately: a distributed hash table which is easy to partition across the nodes of a cluster
- Examples: Cassandra, HBase
Document Stores
- A document store is like a key-value store but with more structure
- You insert documents (a bag of key-value pairs)
- Each document gets indexed in a variety of ways
- Documents can then be found via queries on any attribute
- Documents can be grouped into collections
- Collections can be grouped into databases
- Each database is then used by a particular application to get its work done
- Examples: MongoDB, CouchDB, Solr/Lucene
Why NoSQL?
- What is it about these databases that gives them the moniker NoSQL?
- Implicit in what we have been saying is that there is no schema
- There is nothing that says: in column 5 of table 2, you will find an INT; in column 6, you'll find a VARCHAR, etc.
- You are often free to store ANYTHING in one of these databases
- Examples:
- Document Stores: Each document in a collection can have a different set of key value pairs
- Columnar Store: Each row in a columnar store can have different columns
- A graph database is just a collection of nodes and edges. You can add any amount of metadata to each concept to suit your needs
What's Next?
- Let's look at these technologies in more depth?
- Given that students in this class have experience with MongoDB, I'd like to start with Document data stores
- I need volunteers for next week!