Lecture 17 — Data Engineering

MongoDB, Part Two

Lecture 17 — Data Engineering — Spring 2015

March 10, 2015

Credit Where Credit Is Due

Some of the material in this lecture is adapted from:
- MongoDB: The Definitive Guide, 2nd Edition
- by Kristina Chodorow
Excellent Resource!

Overview

More on Compound Indexes
Full-Text Indexes
Geospatial Indexes
MapReduce

Indexes

Last Lecture: Motivated the Use of Indexes
- Can greatly reduce the number of documents that need to be examined to satisfy a query
- Indeed, is required in order to sort large collections due to the 32MB limit on in-memory sorting
Lot to Learn: We ended with an example where a compound index could not be used to sort its collection in reverse

Terminology

Index Cardinality
- Refers to the number of possible values for an indexed field
- A field like employment status has low cardinality since it has two values: yes/no.
- A field like name has high cardinality since each document might have a different value for that field
- In general, you want indexes on high cardinality fields
  - They have the greatest potential for reducing a query down to a small number of documents

Sort Order (1)

Last time, we had created a compound index that looked like this:
- db.tweets.ensureIndex({'user.created_at': 1, 'created_at': -1});
This indexes a set of tweets
- first by the date each user joined Twitter (ascending)
- and then each user's tweets are sorted from most recent to least recent (descending)

Sort Order (2)

Given this index: {'user.created_at': 1, 'created_at': -1}
We can sort like this:
- db.tweets.find().sort({'user.created_at': 1, 'created_at': -1})
But we can't sort like this:
- db.tweets.find().sort({'user.created_at': -1, 'created_at': -1})
We get an out-of-memory error

Sort Order (3)

It looks like we reversed the sort, but we didn't
- To truly reverse it, we have to imagine multiplying each part of the compound index by -1
So this is the reverse sort of the original:
- db.tweets.find().sort({'user.created_at': -1, 'created_at': 1})
This gives us the oldest tweets of the youngest Twitter user
- Rather than the youngest tweets of the oldest Twitter user

Example (1)

Let's look at an example
We will dynamically generate a MongoDB database/collection with the following program:


for (i=0; i<10000000; i++) {
  db.users.insert(
    {
      "i"        : i,
      "username" : "user"+i,
      "age"      : Math.floor(Math.random()*120),
      "created"  : new Date()
    }
  );
}

This creates a collection with 10M documents
Note: this code took a LONG time to run!

Example (2)

A collection with ten million documents must be handled with care
- Without indexes, all queries are slow
- The collection cannot be sorted except by _id
- Each new index may be expensive in terms of disk space

Example (3)

Without an index, all documents are scanned
db.users.find({username: "user101"}).explain("executionStats")


"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 1,
  "executionTimeMillis" : 3150,
  "totalKeysExamined" : 0,
  "totalDocsExamined" : 10000000,
  "executionStages" : {
    "stage" : "COLLSCAN",
    "filter" : {
      "username" : {
        "$eq" : "user101"
      }
    },
    "nReturned" : 1,

Example (4)

We can optimize this particular query by telling Mongo to stop looking as soon as it finds a match
db.users.find({username: "user101"}).limit(1).explain("executionStats")


"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 1,
  "executionTimeMillis" : 0,
  "totalKeysExamined" : 0,
  "totalDocsExamined" : 102,
  "executionStages" : {
    "stage" : "LIMIT",
    "nReturned" : 1,

But this strategy won't work for all situations

Example (5)

As a result, we use an index
db.users.ensureIndex({"username" : 1})
db.users.find({username: "user101"}).explain("executionStats")


"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 1,
  "executionTimeMillis" : 0,
  "totalKeysExamined" : 1,
  "totalDocsExamined" : 1,
  "executionStages" : {
    "stage" : "FETCH",
    "nReturned" : 1,

With the index created on username
- queries just on that attribute are now instantaneous
A search on any user is just as fast.

Back To Sorting (1)

If we try to sort this collection like this:
- db.users.find().sort({"age" : 1, "username" : 1})
We fail. The existing index cannot be used and we can't sort 10M documents in memory
So, we create the compound index:
- db.users.ensureIndex({"age" : 1, "username" : 1})

Back to Sorting (2)

We can now work with those two attributes efficiently


> db.users.find({}, {"age": 1, "username": 1, "_id": 0}).sort({"age" : 1, "username" : 1})
{ "age" : 0, "username" : "user1000088" }
{ "age" : 0, "username" : "user100009" }
{ "age" : 0, "username" : "user1000146" }
{ "age" : 0, "username" : "user1000159" }
{ "age" : 0, "username" : "user100022" }
{ "age" : 0, "username" : "user1000305" }
{ "age" : 0, "username" : "user1000351" }
{ "age" : 0, "username" : "user1000365" }
{ "age" : 0, "username" : "user1000935" }
{ "age" : 0, "username" : "user1001269" }
{ "age" : 0, "username" : "user100139" }
{ "age" : 0, "username" : "user1001551" }
{ "age" : 0, "username" : "user1001582" }
{ "age" : 0, "username" : "user1001665" }
{ "age" : 0, "username" : "user1001801" }
{ "age" : 0, "username" : "user1001895" }
{ "age" : 0, "username" : "user1001946" }
{ "age" : 0, "username" : "user1001964" }
{ "age" : 0, "username" : "user1001972" }
{ "age" : 0, "username" : "user100199" }

Back to Sorting (3)

And backwards (but no other sort combinations!)


> db.users.find({}, {"age": 1, "username": 1, "_id": 0}).sort({"age" : -1, "username" : -1})
{ "age" : 119, "username" : "user9999865" }
{ "age" : 119, "username" : "user9999844" }
{ "age" : 119, "username" : "user9999671" }
{ "age" : 119, "username" : "user9999618" }
{ "age" : 119, "username" : "user9999317" }
{ "age" : 119, "username" : "user9999151" }
{ "age" : 119, "username" : "user9999061" }
{ "age" : 119, "username" : "user9998918" }
{ "age" : 119, "username" : "user9998884" }
{ "age" : 119, "username" : "user9998757" }
{ "age" : 119, "username" : "user9998700" }
{ "age" : 119, "username" : "user9998691" }
{ "age" : 119, "username" : "user9998613" }
{ "age" : 119, "username" : "user9998520" }
{ "age" : 119, "username" : "user9998509" }
{ "age" : 119, "username" : "user9998470" }
{ "age" : 119, "username" : "user9998333" }
{ "age" : 119, "username" : "user9998119" }
{ "age" : 119, "username" : "user9998026" }
{ "age" : 119, "username" : "user9997947" }

Indexes Enable Queries (1)

Indexes enable three types of queries
- Point Queries: search for a single value, then traverse index (either direction)
  - db.users.find({"age" : 21}, {"age": 1, "username": 1, "_id": 0}).sort({"username" : 1})
  - db.users.find({"age" : 21}, {"age": 1, "username": 1, "_id": 0}).sort({"username" : -1})

Indexes Enable Queries (2)

Indexes enable three types of queries
- Multi-Value Query: search for a range of values
  - db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}, {"age": 1, "username": 1, "_id": 0})
- This is efficient because age is a prefix of the index.
- Mongo can use that index to narrow down the range of documents

Indexes Enable Queries (3)

Indexes enable three types of queries
- Multi-Value Query With Sort: search for a range of values and then sort
  - db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}, {"age": 1, "username": 1, "_id": 0}).sort({"username": 1})
- This query is less efficient because it can no longer return documents in index order. The index is ordered by age first, then by username.
- As a result, Mongo falls back to the original index on username and goes through that index, picking out documents whose age parameter is in the desired range

Other Index Types

Full-Text Indexes
Geospatial Indexes
There are others, but they are out-of-scope

Full-Text Indexes

MongoDB has support for full text search
To see why it's needed, let's switch back to our snow data set and try to search for a text phrase without an index
We can use a regular expression to find, for instance, all tweets that contain the phrase proper shoes.
- db.tweets.find({"text": /proper shoes/i}, {"_id": 0, "text": 1})
Same old story: MongoDB has to look at every document, taking 700 ms; in this case it finds six tweets with the phrase

Using Full-Text Indexes

Create the full-text index:
- db.tweets.ensureIndex( { "text": "text" } );
- On my machine, this took several minutes!
Use the $text operator to query the index
- db.tweets.find({$text: { $search: "\"proper shoes\""}}, {"_id": 0, "text": 1})
In this case, it finds the six documents in 15 ms having reviewed just 2124 of the 625800 tweets (0.3%)!
Note: we cannot use regular expressions with the $search operator
We had to search for the phrase "proper shoes"
- Otherwise it would have searched for "proper" or "shoes" and returned 2118 documents!

Not a Panacea

In this particular data set, we created it by looking for all tweets that contained the word 'snow'
So, if we search the text index for the word 'snow', it is going to return all of the documents!
Likewise, if we include snow in any search with other terms, we might reduce the number of documents we look at via the other terms but most likely we will still end up looking at most of the index/documents

The $text operator

The $text operator in a find command specifies that you want to use the full-text index
There can only be one full-text index per collection
If you want to have more than one document attribute included in the index, you specify that up front
- db.tweets.ensureIndex({"text": "text", "user.description": "text"})

Query Syntax

The $search operator expects a string that contains the full-text search you would like to perform
A string with individual words is considered an OR query
Words surrounded by quotes is an exact phrase search
Words that have a hyphen in front of them are considered negations
- {$search: "snow -day"}: find all tweets that had the word snow in it but did not have the word day in it.

Compound Text Indexes

You can create a compound index that can help limit the amount of search you do on the text index
db.tweets.ensureIndex({ "user.screen_name": 1, "text": "text"})
Now, when you do a search like this...
- db.tweets.find({"user.screen_name": "kenbod", $text: { $search: "apple ipad iphone watch"}})
... MongoDB will first limit the search to Prof. Anderson's tweets and then look for tweets containing the specified words.

Geospatial Indexes

MongoDB supports two types of geospatial indexes
- We will be looking at just one called the 2dsphere index
This index requires that the data in the attribute that is being indexed is in GeoJSON format.
- Mongo supports GeoJSON points, lines, and polygons

GeoJSON Examples

GeoJSON Point
- { "type": "Point", "coordinates": [100.0, 0.0] }
- Coordinates are specified [longitude, latitude]
GeoJSON LineString
- { "type": "LineString", "coordinates": [ [100.0, 0.0], [101.0, 1.0] ] }
GeoJSON Polygon
- { "type": "Polygon", "coordinates": [[[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]]]}
- Coordinates is an array of arrays of points. The first subarray specifies the exterior of the polygon. Any other arrays specify interior holes.

Creating a 2dsphere Index

Creating a 2dsphere index is simple (once you know that your values are properly formatted GeoJSON)
In our tweets collection, the "coordinates" attribute is formatted in this way
The entire collection has 625,800 tweets
- Of these, 17,242 are geo-tagged
- I used this query to discover that number: db.tweets.find({'coordinates': {$exists: true}}).count()
To create the index:
- db.tweets.ensureIndex({"coordinates": "2dsphere"})

Creating a Query (1)

In most cases, you want to specify a bounding box for the queries that you will make against this index
The available query operators are:
- $geoWithin: Find objects that are completely surrounded by a given bounding box
- $geoIntersects: Find objects that intersect a given bounding box
- $near and $nearSphere: Find objects that are near a given point. The two operators differ by how the calculation is performed. They return objects from nearest to farthest
  - $minDistance and $maxDistance (both specified in meters) can be used to constrain the search

Creating a Query (2)

How do we go about creating a bounding box?
Use <http://geojson.io/>!

Creating a Query (3)

Using that website, I created this bounding box for the City of Boulder


"geometry": {
  "type": "Polygon",
  "coordinates": [
    [
      [
        -105.2548599243164,
        39.957911932580835
      ],
      [
        -105.29022216796875,
        39.989221102071994
      ],
      [
        -105.29193878173828,
        40.063621419947914
      ],
      [
        -105.23975372314453,
        40.05337319344778
      ],
      [
        -105.22361755371092,
        39.98027708862263
      ],
      [
        -105.2548599243164,
        39.957911932580835
      ]
    ]
  ]
}

Creating a Query (4)

We can now create a query
Assume that the bounding box is stored in a variable called cityOfBoulder
To find the tweets that are within the City of Boulder, we can use this query:
- db.tweets.find({"coordinates": { $geoWithin: { $geometry: cityOfBoulder} } })
- It turns out that 30 of the 600K tweets are from our own city
The execution stats show that the index is making the query effecient


"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 30,
  "executionTimeMillis" : 0,
  "totalKeysExamined" : 32,
  "totalDocsExamined" : 30,

Final Query

Let's see which of those tweets was created closest to our classroom
I used GeoJSON.io to create a point almost on top of our classroom:


"geometry": {
  "type": "Point",
  "coordinates": [
    -105.26360392570496,
    40.00682620659043
  ]
}

Assume this value is stored in a variable called ClassRoom
I can now ask MongoDB which of the 30 tweets was produced closest to our classroom:
db.tweets.find({"coordinates": {$nearSphere: { $geometry: ClassRoom }}})

Map Reduce

Mongo offers the ability to create new collections from old collections using a technique called Map Reduce
Map Reduce takes two functions: a map function and a reduce function
It can also take a query, if so, it first finds the results of the query and then applies MapReduce to the result set

Map Functions

A map function in MongoDB is a javascript function
This function takes no parameters but it is called such that the this variable points at a document in the input collection
The function can perform calculations if it wants and then emit zero or more documents based on the information it finds
Here's a function that can be used to count up the number of times a screen name appears in a collection


function () {
  emit(this.user.screen_name, {count: 1});
}

Reduce Functions

A reduce function takes a key and a set of produced documents that were previously emited
It must return a document that can be sent back into the reduce function in a subsequent phase
For our example, that means it must generate documents of the form {count: X} where x is a number


function (key, docs) {
  var total = 0;
  for (var i = 0; i < docs.length; i++) {
    total += docs[i].count;
  }
  return {count : total};
}

The Result

Assuming the previous functions are in variables map and reduce, you can run this command like this:
- db.tweets.mapReduce(map, reduce, {out: "users"});
At the end of the operation, you will have a set of documents in a new collection
Each document will have one key (in this case a screen_name) and a value that is equal to the number of times it appeared in the collection

Wrap Up

Hopefully, you now have a good solid introduction to MongoDB
You can use this understanding to now get up to speed with other NoSQL document databases that you encounter

MongoDB, Part Two

Lecture 17 — Data Engineering — Spring 2015

Credit Where Credit Is Due

Overview

Indexes

Terminology

More on Compound Indexes

Sort Order (1)

Sort Order (2)

Sort Order (3)

Example (1)

Example (2)

Example (3)

Example (4)

Example (5)

Back To Sorting (1)

Back to Sorting (2)

Back to Sorting (3)

Indexes Enable Queries (1)

Indexes Enable Queries (2)

Indexes Enable Queries (3)

Other Index Types

Full-Text Indexes

Using Full-Text Indexes

Not a Panacea

The $text operator

Query Syntax

Compound Text Indexes

Geospatial Indexes

GeoJSON Examples

Creating a 2dsphere Index

Creating a Query (1)

Creating a Query (2)

Creating a Query (3)

Creating a Query (4)

Final Query

Map Reduce

Map Functions

Reduce Functions

The Result

Wrap Up