CouchDB
Lecture 13 — Data Engineering — Spring 2015
February 24, 2015
Credit Where Credit is Due
Document Database
- CouchDB is a Document NoSQL Database
- It is implemented in Erlang, a language that has seen a lot of use in telecommunications
- Erlang has built in support for massive concurrency, fault tolerance, and support for distributed systems
- All of these features are on display in the design of CouchDB
- CouchDB's design embraces the web
- RESTful API
- Built for Distribution
- High availability by trading consistency for eventual consistency
Document Model
- Document databases: self-contained data
- CouchDB stores documents
- Each document contains everything that might be needed by an application that makes use of it
- Rather than containing a few data elements and then pointers (i.e. foreign keys) to other related data elements
- No schema is enforced: each document can have a different set of attributes
- Allows for natural modeling of domains
- Attributes can contain embedded documents
CAP Theorem
- When designing a distributed data store, there are issues you must confront as soon as your system has more than one server running
- The issues are:
- Consistency: All clients see the same data even in the presence of concurrent updates
- Availability: All clients are able to read or write the data store when they want
- Partition Tolerance: A database can be split across multiple servers
- The CAP Theorem states: PICK ANY TWO
Choices, Choices
- Picking two of the characteristics for your system provides very different capabilities
- Consistency and Availability: What relational databases provide; they have low partition tolerance as we dicussed in Lecture 12
- Availability and Partition Tolerance: Provides the ability to scale horizontally and always be available for requests but can only guarantee eventual consistency
- Consistency and Partition Tolerance: Able to provide consistency across multiple databases at the price of not always being available for client requests
- CouchDB (and many NoSQL systems) choose number two!
Where the Magic Happens (1)
- Internally, CouchDB makes use of a B-tree storage engine
- Automatic sorting; Allows searches, insertions, and deletions to occur in logarithmic time
- Employs MapReduce over this B-tree to compute
views
of the data (more on this later) allowing for parallel and incremental computation
- No Locking
- In the same way that Git does not require locks to allow multiple people to edit a repo, CouchDB does not require locks for concurrent access to its documents
- Each read of a document, returns a version of the document that was the latest when the read started
- A new version of the document can be written while a read occurs; the next read will return the new document
Where the Magic Happens (2)
- Validation: Validation functions can be written in Javascript for a particular class of document
- Each time an update for a document is submitted, the proposed change is passed to the validation function
- The validation function can then choose to approve or deny the update
- This can reduce the amount of work on the client and can ensure that a bad client cannot maliciously insert bad data into the system
- Incremental Replication
- You can choose to synchronize data between two servers whenever you like
- Once replication is done, each copy is independent (like git)
Where the Magic Happens (3)
- Merge Conflicts?
- Like git, you can encounter a merge conflict
- Each document in CouchDB has an id (_id) and a revision id (_rev).
- When you update a document in two different servers and then synchronize, CouchDB can automatically detect and resolve conflicts
- If CouchDB cannot automatically resolve the conflict, then it allows the application to decide what to do (again, just like git)
Installing CouchDB
- You can download CouchDB from the official website
- I recommend using a package manager
- On MacOS X with Homebrew:
Trying It Out
- In one window:
couchdb
- In another window:
$ curl http://127.0.0.1:5984/
{
"couchdb":"Welcome",
"uuid":"c4d4f27dc0623da471f82050a8b35f55",
"version":"1.6.1",
"vendor": {
"version":"1.6.1-1",
"name":"Homebrew"}
}
Ask for a List of Databases
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users"]
- These are internal databases used by CouchDB
Create a New Database
$ curl -X PUT http://127.0.0.1:5984/tweets
{"ok": true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users","tweets"]
- Tweets was created and now appears in the list
Delete a Database
$ curl -X PUT http://127.0.0.1:5984/delete_me
{"ok": true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users","delete_me","tweets"]
$ curl -X DELETE http://127.0.0.1:5984/delete_me
{"ok": true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users","tweets"]
- CRUD operations are supported
Web-Based Interface
- Visit <http://127.0.0.1:5984/_utils/> to access CouchDB's built-in web administration interface
- You can use this interface to create documents, view documents, query databases, create/delete databases, etc.
DEMO
- List Databases
- Validate Installation
- View documents
- Delete Databases
- Create a New Database
- Create a New Document
- Auto-creates _id and _rev
- View Document as JSON Object
Creating Documents: REST API
- To create a document via the REST API, you have to PUT a JSON document to a URL of the form:
- http://127.0.0.1:5984/<database>/<UUID>
- How do you get a UUID?
- You can ask CouchDB for as many as you need!
$ curl -X GET http://127.0.0.1:5984/_uuids
{"uuids":["8cf97aa5465a652dd49157457b000910"]}
$ curl -X GET http://127.0.0.1:5984/_uuids?count=10
{"uuids":[
"8cf97aa5465a652dd49157457b000a5e",
"8cf97aa5465a652dd49157457b0011c2",
"8cf97aa5465a652dd49157457b001264",
"8cf97aa5465a652dd49157457b001a65",
"8cf97aa5465a652dd49157457b002119",
"8cf97aa5465a652dd49157457b0025d6",
"8cf97aa5465a652dd49157457b0026f7",
"8cf97aa5465a652dd49157457b00298b",
"8cf97aa5465a652dd49157457b002e2f",
"8cf97aa5465a652dd49157457b00323d"]}
Creating Documents: REST API
- Once you have a UUID, you're ready to create a document
- Let's add a new document to our hello_world database
$ curl -X PUT http://127.0.0.1:5984/hello_world/8cf97aa5465a652dd49157457b000910
-d '{"name": "Ken Anderson", "twitter": "@kenbod", "years_at_CU": 17}'
{ "ok":true,
"id":"8cf97aa5465a652dd49157457b000910",
"rev":"1-bf4d017af7025ecf0de2f9bd94b3c14f"}
- As you can see, CouchDB uses the id that we supplied and generates a revision id.
Updating a Document (1)
- When you want to update a document, you have to follow this workflow:
- Read the entire document
- Provides you with the _rev id
- Change it
- Store the updated document
- Sending back the _rev id you received
- I discussed this approach for updating resources in the context of distributed systems at the beginning of the semester
- Each time a document is updated, a new _rev id is generated
- If you send an update with the wrong _rev id, CouchDB detects that you were operating on stale data and refuses the update
Updating a Document (2)
curl -X PUT http://127.0.0.1:5984/hello_world/8cf97aa5465a652dd49157457b000910
-d '{"name": "Ken Anderson", "twitter": "@kenbod", "years_at_CU": 18}'
{"error":"conflict",
"reason":"Document update conflict."}
- Here, I tried to update the number of years I've been at CU without including the _rev id
Updating a Document (2)
- To make this work, I need to include the _rev id
curl -X PUT http://127.0.0.1:5984/hello_world/8cf97aa5465a652dd49157457b000910
-d '{"name": "Ken Anderson", "twitter": "@kenbod", "years_at_CU": 18, "_rev": "1-bf4d017af7025ecf0de2f9bd94b3c14f"}'
{"ok":true,"id":"8cf97aa5465a652dd49157457b000910",
"rev":"2-14e8d90d996430b6a7a1e8eda76eeefe"}
- Now the update is applied and a new _rev id is generated.
- Note: the prefix of the _rev id
- 1-... and 2-...; The prefix tracks the number of times the document has been updated!
Views in CouchDB
- Let's create our first view
- Views are calculated via MapReduce
- CouchDB's GUI allows for our Map and Reduce functions to be specified in the browser
- Let's create a few documents about grocery items with price information and then sort by price
Example Document
{
"_id": "00a271787f89c0ef2e10e88a0c0003f0",
"_rev": "1-e9680c5d9a688b4ff8dd68549e8e072c",
"item": "orange",
"prices": {
"Fresh Mart": 1.99,
"Price Max": 3.19,
"Citrus Circus": 1.09
}
}
Example Map Function
function(doc) {
var shop, price, value;
if (doc.item && doc.prices) {
for (shop in doc.prices) {
price = doc.prices[shop];
value = [doc.item, shop];
emit(price, value);
}
}
}
Results
- When the Map function is applied to our documents, it produces (i.e.
emits) a bunch of small documents with price as the key. CouchDB automatically sorts the view by this key
- Example:
| Key |
Value |
| 0.79 |
["apple", "Apples Express"] |
| 0.79 |
["banana", "Price Max"] |
| 1.09 |
["orange", "Citrus Circus"] |
| ... |
... |
Different Map Function
Change the emitted key, get different results
function(doc) {
var shop, price, value;
if (doc.item && doc.prices) {
for (shop in doc.prices) {
price = doc.prices[shop];
key = [doc.item, price];
emit(key, shop);
}
}
}
Results
| Key |
Value |
| ["apple", 0.79] |
"Apples Express" |
| ["apple", 1.59] |
"Fresh Mart" |
| ["apple", 5.99] |
"Price Max" |
| ["banana", 0.79] |
"Price Max" |
| ... |
... |
Replication
- We can demonstrate CouchDB's support for replication in the browser as well
- First, we create an empty database: food_copy
- Second, click Replicator in the side bar
- Specify groceries as the source and food_copy as the destination
- Click Replicate
Creating Web Apps in CouchDB
- CouchDB has a powerful concept called Design Documents
- Design Documents allow you to create web apps that are served directly out of CouchDB
- I'm not going to cover Design Documents today
Twitter Example
- Instead, let's see how we can use CouchDB to store and search tweet objects from the Twitter API
- Recall that I downloaded ~300MB of tweet objects last week
- We'll use those objects to set-up Thursday's lecture, where I will go into detail on CouchDB views and its map-reduce capabilities.
- Straightforward Ruby Code
- Since CouchDB has a REST interface, I make use of Typhoeus
- Input file is a "JSON file": one JSON object per line
- get_uuid: asks CouchDB for a uuid
- insert_tweet: inserts a tweet with the given uuid
- Program loops through the input file and inserts each tweet into a "tweets" database (previously created)
Coming Up Next
- More info on CouchDB's features
- Views (includes Map Reduce)
- Design Documents
- Introduction to MongoDB