CouchDB

Lecture 13 — Data Engineering — Spring 2015

February 24, 2015

Credit Where Credit is Due

Document Database

  • CouchDB is a Document NoSQL Database
  • It is implemented in Erlang, a language that has seen a lot of use in telecommunications
    • Erlang has built in support for massive concurrency, fault tolerance, and support for distributed systems
    • All of these features are on display in the design of CouchDB
  • CouchDB's design embraces the web
    • RESTful API
    • Built for Distribution
    • High availability by trading consistency for eventual consistency

Document Model

  • Document databases: self-contained data
  • CouchDB stores documents
  • Each document contains everything that might be needed by an application that makes use of it
    • Rather than containing a few data elements and then pointers (i.e. foreign keys) to other related data elements
  • No schema is enforced: each document can have a different set of attributes
    • Allows for natural modeling of domains
    • Attributes can contain embedded documents

CAP Theorem

  • When designing a distributed data store, there are issues you must confront as soon as your system has more than one server running
  • The issues are:
    • Consistency: All clients see the same data even in the presence of concurrent updates
    • Availability: All clients are able to read or write the data store when they want
    • Partition Tolerance: A database can be split across multiple servers
  • The CAP Theorem states: PICK ANY TWO

Choices, Choices

  • Picking two of the characteristics for your system provides very different capabilities
    • Consistency and Availability: What relational databases provide; they have low partition tolerance as we dicussed in Lecture 12
    • Availability and Partition Tolerance: Provides the ability to scale horizontally and always be available for requests but can only guarantee eventual consistency
    • Consistency and Partition Tolerance: Able to provide consistency across multiple databases at the price of not always being available for client requests
  • CouchDB (and many NoSQL systems) choose number two!

Where the Magic Happens (1)

  • Internally, CouchDB makes use of a B-tree storage engine
    • Automatic sorting; Allows searches, insertions, and deletions to occur in logarithmic time
  • Employs MapReduce over this B-tree to compute views of the data (more on this later) allowing for parallel and incremental computation
  • No Locking
    • In the same way that Git does not require locks to allow multiple people to edit a repo, CouchDB does not require locks for concurrent access to its documents
    • Each read of a document, returns a version of the document that was the latest when the read started
    • A new version of the document can be written while a read occurs; the next read will return the new document

Where the Magic Happens (2)

  • Validation: Validation functions can be written in Javascript for a particular class of document
    • Each time an update for a document is submitted, the proposed change is passed to the validation function
    • The validation function can then choose to approve or deny the update
    • This can reduce the amount of work on the client and can ensure that a bad client cannot maliciously insert bad data into the system
  • Incremental Replication
    • You can choose to synchronize data between two servers whenever you like
    • Once replication is done, each copy is independent (like git)

Where the Magic Happens (3)

  • Merge Conflicts?
    • Like git, you can encounter a merge conflict
    • Each document in CouchDB has an id (_id) and a revision id (_rev).
    • When you update a document in two different servers and then synchronize, CouchDB can automatically detect and resolve conflicts
    • If CouchDB cannot automatically resolve the conflict, then it allows the application to decide what to do (again, just like git)

Installing CouchDB

  • You can download CouchDB from the official website
  • I recommend using a package manager
    • On MacOS X with Homebrew:
      • brew install couchdb

Trying It Out

  • In one window: couchdb
  • In another window:

$ curl http://127.0.0.1:5984/
{
  "couchdb":"Welcome",
  "uuid":"c4d4f27dc0623da471f82050a8b35f55",
  "version":"1.6.1",
  "vendor": {
    "version":"1.6.1-1",
    "name":"Homebrew"}
}

Ask for a List of Databases


$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users"]
          
  • These are internal databases used by CouchDB

Create a New Database


$ curl -X PUT http://127.0.0.1:5984/tweets
{"ok": true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users","tweets"]
          
  • Tweets was created and now appears in the list

Delete a Database


$ curl -X PUT http://127.0.0.1:5984/delete_me
{"ok": true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users","delete_me","tweets"]
$ curl -X DELETE http://127.0.0.1:5984/delete_me
{"ok": true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_replicator","_users","tweets"]
          
  • CRUD operations are supported

Web-Based Interface

  • Visit <http://127.0.0.1:5984/_utils/> to access CouchDB's built-in web administration interface
  • You can use this interface to create documents, view documents, query databases, create/delete databases, etc.

DEMO

  • List Databases
  • Validate Installation
  • View documents
  • Delete Databases
  • Create a New Database
  • Create a New Document
    • Auto-creates _id and _rev
  • View Document as JSON Object

Creating Documents: REST API

  • To create a document via the REST API, you have to PUT a JSON document to a URL of the form:
    • http://127.0.0.1:5984/<database>/<UUID>
  • How do you get a UUID?
    • You can ask CouchDB for as many as you need!

$ curl -X GET http://127.0.0.1:5984/_uuids
{"uuids":["8cf97aa5465a652dd49157457b000910"]}
$ curl -X GET http://127.0.0.1:5984/_uuids?count=10
{"uuids":[
  "8cf97aa5465a652dd49157457b000a5e",
  "8cf97aa5465a652dd49157457b0011c2",
  "8cf97aa5465a652dd49157457b001264",
  "8cf97aa5465a652dd49157457b001a65",
  "8cf97aa5465a652dd49157457b002119",
  "8cf97aa5465a652dd49157457b0025d6",
  "8cf97aa5465a652dd49157457b0026f7",
  "8cf97aa5465a652dd49157457b00298b",
  "8cf97aa5465a652dd49157457b002e2f",
  "8cf97aa5465a652dd49157457b00323d"]}
          

Creating Documents: REST API

  • Once you have a UUID, you're ready to create a document
  • Let's add a new document to our hello_world database

$ curl -X PUT http://127.0.0.1:5984/hello_world/8cf97aa5465a652dd49157457b000910
-d '{"name": "Ken Anderson", "twitter": "@kenbod", "years_at_CU": 17}'

{ "ok":true,
  "id":"8cf97aa5465a652dd49157457b000910",
  "rev":"1-bf4d017af7025ecf0de2f9bd94b3c14f"}
          
  • As you can see, CouchDB uses the id that we supplied and generates a revision id.

Updating a Document (1)

  • When you want to update a document, you have to follow this workflow:
    • Read the entire document
      • Provides you with the _rev id
    • Change it
    • Store the updated document
      • Sending back the _rev id you received
  • I discussed this approach for updating resources in the context of distributed systems at the beginning of the semester
    • Each time a document is updated, a new _rev id is generated
    • If you send an update with the wrong _rev id, CouchDB detects that you were operating on stale data and refuses the update

Updating a Document (2)

  • Let's see this in action

curl -X PUT http://127.0.0.1:5984/hello_world/8cf97aa5465a652dd49157457b000910
-d '{"name": "Ken Anderson", "twitter": "@kenbod", "years_at_CU": 18}'

{"error":"conflict",
 "reason":"Document update conflict."}
          
  • Here, I tried to update the number of years I've been at CU without including the _rev id

Updating a Document (2)

  • To make this work, I need to include the _rev id

curl -X PUT http://127.0.0.1:5984/hello_world/8cf97aa5465a652dd49157457b000910
-d '{"name": "Ken Anderson", "twitter": "@kenbod", "years_at_CU": 18, "_rev": "1-bf4d017af7025ecf0de2f9bd94b3c14f"}'

{"ok":true,"id":"8cf97aa5465a652dd49157457b000910",
"rev":"2-14e8d90d996430b6a7a1e8eda76eeefe"}
          
  • Now the update is applied and a new _rev id is generated.
  • Note: the prefix of the _rev id
    • 1-... and 2-...; The prefix tracks the number of times the document has been updated!

Views in CouchDB

  • Let's create our first view
  • Views are calculated via MapReduce
  • CouchDB's GUI allows for our Map and Reduce functions to be specified in the browser
  • Let's create a few documents about grocery items with price information and then sort by price

Example Document


{
   "_id": "00a271787f89c0ef2e10e88a0c0003f0",
   "_rev": "1-e9680c5d9a688b4ff8dd68549e8e072c",
   "item": "orange",
   "prices": {
       "Fresh Mart": 1.99,
       "Price Max": 3.19,
       "Citrus Circus": 1.09
   }
}          

Example Map Function


function(doc) {
  var shop, price, value;
  if (doc.item && doc.prices) {
    for (shop in doc.prices) {
      price = doc.prices[shop];
      value = [doc.item, shop];
      emit(price, value);
    }
  }
}
          

Results

  • When the Map function is applied to our documents, it produces (i.e. emits) a bunch of small documents with price as the key. CouchDB automatically sorts the view by this key
  • Example:
Key Value
0.79 ["apple", "Apples Express"]
0.79 ["banana", "Price Max"]
1.09 ["orange", "Citrus Circus"]
... ...

Different Map Function

Change the emitted key, get different results


function(doc) {
  var shop, price, value;
  if (doc.item && doc.prices) {
    for (shop in doc.prices) {
      price = doc.prices[shop];
      key = [doc.item, price];
      emit(key, shop);
    }
  }
}
          

Results

Key Value
["apple", 0.79] "Apples Express"
["apple", 1.59] "Fresh Mart"
["apple", 5.99] "Price Max"
["banana", 0.79] "Price Max"
... ...

Replication

  • We can demonstrate CouchDB's support for replication in the browser as well
  • First, we create an empty database: food_copy
  • Second, click Replicator in the side bar
  • Specify groceries as the source and food_copy as the destination
  • Click Replicate

Creating Web Apps in CouchDB

  • CouchDB has a powerful concept called Design Documents
  • Design Documents allow you to create web apps that are served directly out of CouchDB
  • I'm not going to cover Design Documents today

Twitter Example

  • Instead, let's see how we can use CouchDB to store and search tweet objects from the Twitter API
  • Recall that I downloaded ~300MB of tweet objects last week
  • We'll use those objects to set-up Thursday's lecture, where I will go into detail on CouchDB views and its map-reduce capabilities.

Import Tweets

  • Straightforward Ruby Code
  • Since CouchDB has a REST interface, I make use of Typhoeus
  • Input file is a "JSON file": one JSON object per line
  • get_uuid: asks CouchDB for a uuid
  • insert_tweet: inserts a tweet with the given uuid
  • Program loops through the input file and inserts each tweet into a "tweets" database (previously created)

Coming Up Next

  • More info on CouchDB's features
    • Views (includes Map Reduce)
    • Design Documents
  • Introduction to MongoDB