Lecture 12 — Data Engineering

Introduction to NoSQL

Lecture 12 — Data Engineering — Spring 2015

February 19, 2015

Credit Where Credit is Due

For this lecture, I've drawn on material from two books
- Big Data: Principles and best practices of scalable realtime data systems" by Nathan Marz and James Warren
- Making Sense of NoSQL: A guide for managers and the rest of us by Dan McCreary and Ann Kelly

Scaling with
Traditional Databases

Marz and Warren's book starts with an example that illustrates how an organization can be driven to Big Data by trying to scale a system with traditional databases, i.e. relational database technology

Web Analytics Application (1)

Imagine you've been asked to build a system that keeps track of page views on particular URLs
You might build a relational table that looks like this

Column	Type
id	integer
user_id	integer
url	varchar(1024)
pageviews	bigint

Web Analytics Application (2)

When someone loads a page on your client's website, it pings your service and you increment the count associated with that URL
Your client can then query that count when needed for reports or to display the count at the bottom of a web page.
You deploy the system and it's a big success

Web Analytics Application (3)

Indeed, it's too big of a success. You get multiple clients tracking multiple pages and they have lots of customers
You start to get errors that indicate that clients are timing out when attempting to notify your system of a page view
The problem is that so many requests are coming in at once that the relational database cannot mutate rows fast enough (i.e. writes) to keep up with demand.
In addition, its spending so much time on writes that it can't keep up with the requests for the current count of a particular page (i.e. reads).
Something must be done!

Web Analytics Application (4)

You decide that the problem is that it is inefficient to have one update to the database per request. You should instead batch updates to the database, asking it to update multiple rows with each request
You insert a queue between your web server and the code that will update the database
You attach a single worker to the queue; that worker reads 1000 updates off the queue and then inserts them all into the database at once.
- This change drastically reduces the number of write requests
- The database can handle read requests as normal without the use of a queue.
Happy Days! Your system is working once again!

Web Analytics Application (5)

But, you discover this solution only works temporarily
Your system gets even more popular flooding your queue with write requests. The database again starts to struggle to keep up with all of the writes that it must do and this, in turn, impacts all of the requests to your system causing many of them to time out
You briefly try to add more workers to the queue in an attempt to speed thing up with concurrency but it doesn't work
Why?

Web Analytics Application (6)

It doesn't work because all of the workers still have to write to a single database
The database is the BOTTLENECK
What is the answer to this problem in the relational world?

Web Analytics Application (7)

You have to SHARD the database
What does that mean?
It means that you need multiple copies of the database
You then PARTITION your data across those databases
To do that, you have to develop a partitioning strategy
- often, you will take an MD5 hash of some aspect of the input data and then mod that value by the number of shards
- You then write the data to the indicated shard
- You do the same thing for reads to locate the data needed to fulfill a request

Web Analytics Application (8)

Problems?
- Sharding is an application-level concern: you have to manage the number of shards.
- If the number of shards changes, you have to remap your entire database across all of those shards; while you're doing that, you have to turn your application off; you can't increment the data while your resharding
- If you make a mistake when doing a re-shard, guess who has to fix it? You. That translates into MORE work and more downtime.

Web Analytics Application (9)

You are now in a situation where you have lots of database instances to manage on potentially lots of machines.
All of the machines are needed; if one goes down, the whole system goes down!
And this brings you to the realm of fault tolerance
It's not if your machine goes down, it's WHEN it goes down

Web Analytics Application (10)

Now you have to add complexity to your application architecture:
- If a shard goes down, and you want to keep your application running, perhaps you add an in-memory queue that holds updates to that shard as pending which will get written to the shard once the machine comes back up
- What if the machine went down, however, because the disk went bad? Are you backing this information up? How?
- If you replicate your master nodes (i.e. the shards), you might allow clients to read information from the replicated server but you might not want to write to that server; because, at that point, which one is the real database and which one is the copy?

Web Analytics Application (11)

The final problem: Humans!
Imagine you're working in this environment and you make a mistake and deploy a worker script that has a bug in it
It's not a crashing bug but a bug that writes the wrong data in some way
- information goes to the wrong shard when written by one worker and then can't be found by another worker when it wants to read it
- Or the worker accidentally increments the count in the wrong way: it's writing the WRONG information into the database
- And, recall, we're mutating the database; we are not storing previous values of the columns in our rows
- So the RIGHT information cannot be recovered

Analysis of the Problems with the Traditional Approach

Marz and Warren end their example with the following analysis
Scaling your application added significant complexity to it and to its service and persistence tiers (i.e. the back-end)
- You started with a web-server and ended with queues, workers, shards, and replicas, and all glue code required to make it all work
With these changes came significant operational complexity as well

Analysis of the Problems with the Traditional Approach

The challenges encountered are:
- Fault Tolerance is Hard: Keeping the system running in the presence of hardware failure required queues and replica servers all managed by hand
- Complexity is pushed to the application layer: the distributed nature of your persistence tier is not hidden from the application code in your service tier; your application knows the number of shards, it has to compute where information is stored, for queries that span shards, it has to manage the entire query process

Analysis of the Problems with the Traditional Approach

The challenges encountered are:
- Lack of human fault-tolerance: there is nothing protecting you from a human making a mistake causing data to go missing or having incorrect data stored into the system overwriting correct data
- Maintenance is an enormous amount of work: you have to manage all of the complexity of the back-end yourself. Are the machines running? Properly configured? On the network? Is all the software up and running? Is it time for a re-shard?

NoSQL to the Rescue!

NoSQL databases are ones which are aware of their distributed nature
- They manage sharding and replication FOR you!
- They are horizontally scalable
  - If you need more disk space, add a server
  - If you need computation to go faster, add a server
  - When you add a server, the NoSQL database will re-shard for you automatically!

NoSQL to the Rescue!

NoSQL databases tend to avoid mutable data
- You can't lose correct data because...
- ... once it is written it is immutable and can't be updated
- Instead, if a value changes, you write a new immutable copy of the updated data and...
- ... when you read that value, you adopt a strategy of returning the most recently written value
  - and provide an option to go back in time to previous values if you need it

NoSQL to the Rescue!

Finally, NoSQL databases are fault tolerant
- If a disk error takes a machine down, the NoSQL database switches to its replica automatically
- It reshards the database automatically to handle new writes and ensures those writes are also replicated
- When the old machine comes back, it reshards again and adjusts the replicas as needed
All of these things happen in the persistence tier...
... your service tier / application code can be completely unaware that any of that is going on behind the scences!

Types of NoSQL Databases

Key-Value
Graph
Columnar
Documents

Key-Value Stores

Does what it says on the tin
- A key-value store is a simple database that when presented with a string (i.e. a key) returns an arbitrarily large set of data (i.e. the value)
- Key value stores have no query language. They act just like hash tables from programming languages
- Values are untyped; you can store any type of data in these databases
Benefits: Simplicity!
Examples: Amazon SimpleDB, S3, Redis, Voldemort, Riak

Graph Stores

Databases that are optimized to store graph structures rather than table/row/column structures
Provide structural query languages so you can locate information based on the structure of your data
Example: find all pairs of Person nodes who have at least three children together, live in Colorado, and have been married for more than 15 years
Provide the ability to do graph traversals efficiently
Provide the ability to calculate shortest paths between two given nodes; locate a nodes neighbors across n hops, calculate graph-related metrics, etc.
Examples: Neo4J, Titan, Infinite Graph, InfoGrid

Columnar Stores (1)

Also known as Column Family Stores
Able to scale to enormous amounts of data
Often able to achieve very fast writes (milliseconds)...
... while also maintaining reasonable read performance
- This is sometimes complicated by the fact that what you might read from one of these stores is very large
- Consider: Netflix uses Cassandra to store and serve it's movies
- In this case, what you are reading from the database is an entire movie that is then streamed across the Internet!

Columnar Stores (2)

Basic Data Model
- Column Family: think of this as a table of related data
- Column familes consist of rows that have unique row keys
- Rows consist of columns (potentially millions of them)
- Columns consist of a key and a value
- The value itself might be a JSON map that in turn has keys and values
In other words: Hash tables all the way down
More accurately: a distributed hash table which is easy to partition across the nodes of a cluster
Examples: Cassandra, HBase

Document Stores

A document store is like a key-value store but with more structure
You insert documents (a bag of key-value pairs)
Each document gets indexed in a variety of ways
Documents can then be found via queries on any attribute
Documents can be grouped into collections
Collections can be grouped into databases
Each database is then used by a particular application to get its work done
Examples: MongoDB, CouchDB, Solr/Lucene

Why NoSQL?

What is it about these databases that gives them the moniker NoSQL?
Implicit in what we have been saying is that there is no schema
There is nothing that says: in column 5 of table 2, you will find an INT; in column 6, you'll find a VARCHAR, etc.
You are often free to store ANYTHING in one of these databases
Examples:
- Document Stores: Each document in a collection can have a different set of key value pairs
- Columnar Store: Each row in a columnar store can have different columns
- A graph database is just a collection of nodes and edges. You can add any amount of metadata to each concept to suit your needs

What's Next?

Let's look at these technologies in more depth?
Given that students in this class have experience with MongoDB, I'd like to start with Document data stores
I need volunteers for next week!

Introduction to NoSQL

Lecture 12 — Data Engineering — Spring 2015

Credit Where Credit is Due

Scaling withTraditional Databases

Web Analytics Application (1)

Web Analytics Application (2)

Web Analytics Application (3)

Web Analytics Application (4)

Web Analytics Application (5)

Web Analytics Application (6)

Web Analytics Application (7)

Web Analytics Application (8)

Web Analytics Application (9)

Web Analytics Application (10)

Web Analytics Application (11)

Analysis of the Problems with the Traditional Approach

Analysis of the Problems with the Traditional Approach

Analysis of the Problems with the Traditional Approach

NoSQL to the Rescue!

NoSQL to the Rescue!

NoSQL to the Rescue!

Types of NoSQL Databases

Key-Value Stores

Graph Stores

Columnar Stores (1)

Columnar Stores (2)

Document Stores

Why NoSQL?

What's Next?

Scaling with
Traditional Databases