MongoDB, Part Two

Lecture 17 — Data Engineering — Spring 2015

March 10, 2015

Credit Where Credit Is Due

Overview

  • More on Compound Indexes
  • Full-Text Indexes
  • Geospatial Indexes
  • MapReduce

Indexes

  • Last Lecture: Motivated the Use of Indexes
    • Can greatly reduce the number of documents that need to be examined to satisfy a query
    • Indeed, is required in order to sort large collections due to the 32MB limit on in-memory sorting
  • Lot to Learn: We ended with an example where a compound index could not be used to sort its collection in reverse

Terminology

  • Index Cardinality
    • Refers to the number of possible values for an indexed field
    • A field like employment status has low cardinality since it has two values: yes/no.
    • A field like name has high cardinality since each document might have a different value for that field
    • In general, you want indexes on high cardinality fields
      • They have the greatest potential for reducing a query down to a small number of documents

More on Compound Indexes

  • Compound Indexes can be difficult to get right
    • But they are absolutely crucial for performance reasons
  • The key concern is understanding what queries your application makes all the time and then building indexes that match the structure of those queries

Sort Order (1)

  • Last time, we had created a compound index that looked like this:
    • db.tweets.ensureIndex({'user.created_at': 1, 'created_at': -1});
  • This indexes a set of tweets
    • first by the date each user joined Twitter (ascending)
    • and then each user's tweets are sorted from most recent to least recent (descending)

Sort Order (2)

  • Given this index: {'user.created_at': 1, 'created_at': -1}
  • We can sort like this:
    • db.tweets.find().sort({'user.created_at': 1, 'created_at': -1})
  • But we can't sort like this:
    • db.tweets.find().sort({'user.created_at': -1, 'created_at': -1})
  • We get an out-of-memory error

Sort Order (3)

  • It looks like we reversed the sort, but we didn't
    • To truly reverse it, we have to imagine multiplying each part of the compound index by -1
  • So this is the reverse sort of the original:
    • db.tweets.find().sort({'user.created_at': -1, 'created_at': 1})
  • This gives us the oldest tweets of the youngest Twitter user
    • Rather than the youngest tweets of the oldest Twitter user

Example (1)

  • Let's look at an example
  • We will dynamically generate a MongoDB database/collection with the following program:

for (i=0; i<10000000; i++) {
  db.users.insert(
    {
      "i"        : i,
      "username" : "user"+i,
      "age"      : Math.floor(Math.random()*120),
      "created"  : new Date()
    }
  );
}
          
  • This creates a collection with 10M documents
  • Note: this code took a LONG time to run!

Example (2)

  • A collection with ten million documents must be handled with care
    • Without indexes, all queries are slow
    • The collection cannot be sorted except by _id
    • Each new index may be expensive in terms of disk space

Example (3)

  • Without an index, all documents are scanned
  • db.users.find({username: "user101"}).explain("executionStats")

"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 1,
  "executionTimeMillis" : 3150,
  "totalKeysExamined" : 0,
  "totalDocsExamined" : 10000000,
  "executionStages" : {
    "stage" : "COLLSCAN",
    "filter" : {
      "username" : {
        "$eq" : "user101"
      }
    },
    "nReturned" : 1,
          

Example (4)

  • We can optimize this particular query by telling Mongo to stop looking as soon as it finds a match
  • db.users.find({username: "user101"}).limit(1).explain("executionStats")

"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 1,
  "executionTimeMillis" : 0,
  "totalKeysExamined" : 0,
  "totalDocsExamined" : 102,
  "executionStages" : {
    "stage" : "LIMIT",
    "nReturned" : 1,
          

But this strategy won't work for all situations

Example (5)

  • As a result, we use an index
  • db.users.ensureIndex({"username" : 1})
  • db.users.find({username: "user101"}).explain("executionStats")

"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 1,
  "executionTimeMillis" : 0,
  "totalKeysExamined" : 1,
  "totalDocsExamined" : 1,
  "executionStages" : {
    "stage" : "FETCH",
    "nReturned" : 1,
          
  • With the index created on username
    • queries just on that attribute are now instantaneous
  • A search on any user is just as fast.

Back To Sorting (1)

  • If we try to sort this collection like this:
    • db.users.find().sort({"age" : 1, "username" : 1})
  • We fail. The existing index cannot be used and we can't sort 10M documents in memory
  • So, we create the compound index:
    • db.users.ensureIndex({"age" : 1, "username" : 1})

Back to Sorting (2)

  • We can now work with those two attributes efficiently

> db.users.find({}, {"age": 1, "username": 1, "_id": 0}).sort({"age" : 1, "username" : 1})
{ "age" : 0, "username" : "user1000088" }
{ "age" : 0, "username" : "user100009" }
{ "age" : 0, "username" : "user1000146" }
{ "age" : 0, "username" : "user1000159" }
{ "age" : 0, "username" : "user100022" }
{ "age" : 0, "username" : "user1000305" }
{ "age" : 0, "username" : "user1000351" }
{ "age" : 0, "username" : "user1000365" }
{ "age" : 0, "username" : "user1000935" }
{ "age" : 0, "username" : "user1001269" }
{ "age" : 0, "username" : "user100139" }
{ "age" : 0, "username" : "user1001551" }
{ "age" : 0, "username" : "user1001582" }
{ "age" : 0, "username" : "user1001665" }
{ "age" : 0, "username" : "user1001801" }
{ "age" : 0, "username" : "user1001895" }
{ "age" : 0, "username" : "user1001946" }
{ "age" : 0, "username" : "user1001964" }
{ "age" : 0, "username" : "user1001972" }
{ "age" : 0, "username" : "user100199" }
          

Back to Sorting (3)

  • And backwards (but no other sort combinations!)

> db.users.find({}, {"age": 1, "username": 1, "_id": 0}).sort({"age" : -1, "username" : -1})
{ "age" : 119, "username" : "user9999865" }
{ "age" : 119, "username" : "user9999844" }
{ "age" : 119, "username" : "user9999671" }
{ "age" : 119, "username" : "user9999618" }
{ "age" : 119, "username" : "user9999317" }
{ "age" : 119, "username" : "user9999151" }
{ "age" : 119, "username" : "user9999061" }
{ "age" : 119, "username" : "user9998918" }
{ "age" : 119, "username" : "user9998884" }
{ "age" : 119, "username" : "user9998757" }
{ "age" : 119, "username" : "user9998700" }
{ "age" : 119, "username" : "user9998691" }
{ "age" : 119, "username" : "user9998613" }
{ "age" : 119, "username" : "user9998520" }
{ "age" : 119, "username" : "user9998509" }
{ "age" : 119, "username" : "user9998470" }
{ "age" : 119, "username" : "user9998333" }
{ "age" : 119, "username" : "user9998119" }
{ "age" : 119, "username" : "user9998026" }
{ "age" : 119, "username" : "user9997947" }
          

Indexes Enable Queries (1)

  • Indexes enable three types of queries
    • Point Queries: search for a single value, then traverse index (either direction)
      • db.users.find({"age" : 21}, {"age": 1, "username": 1, "_id": 0}).sort({"username" : 1})
      • db.users.find({"age" : 21}, {"age": 1, "username": 1, "_id": 0}).sort({"username" : -1})

Indexes Enable Queries (2)

  • Indexes enable three types of queries
    • Multi-Value Query: search for a range of values
      • db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}, {"age": 1, "username": 1, "_id": 0})
    • This is efficient because age is a prefix of the index.
    • Mongo can use that index to narrow down the range of documents

Indexes Enable Queries (3)

  • Indexes enable three types of queries
    • Multi-Value Query With Sort: search for a range of values and then sort
      • db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}, {"age": 1, "username": 1, "_id": 0}).sort({"username": 1})
    • This query is less efficient because it can no longer return documents in index order. The index is ordered by age first, then by username.
    • As a result, Mongo falls back to the original index on username and goes through that index, picking out documents whose age parameter is in the desired range

Other Index Types

  • Full-Text Indexes
  • Geospatial Indexes
  • There are others, but they are out-of-scope

Full-Text Indexes

  • MongoDB has support for full text search
  • To see why it's needed, let's switch back to our snow data set and try to search for a text phrase without an index
  • We can use a regular expression to find, for instance, all tweets that contain the phrase proper shoes.
    • db.tweets.find({"text": /proper shoes/i}, {"_id": 0, "text": 1})
  • Same old story: MongoDB has to look at every document, taking 700 ms; in this case it finds six tweets with the phrase

Using Full-Text Indexes

  • Create the full-text index:
    • db.tweets.ensureIndex( { "text": "text" } );
    • On my machine, this took several minutes!
  • Use the $text operator to query the index
    • db.tweets.find({$text: { $search: "\"proper shoes\""}}, {"_id": 0, "text": 1})
  • In this case, it finds the six documents in 15 ms having reviewed just 2124 of the 625800 tweets (0.3%)!
  • Note: we cannot use regular expressions with the $search operator
  • We had to search for the phrase "proper shoes"
    • Otherwise it would have searched for "proper" or "shoes" and returned 2118 documents!

Not a Panacea

  • In this particular data set, we created it by looking for all tweets that contained the word 'snow'
  • So, if we search the text index for the word 'snow', it is going to return all of the documents!
  • Likewise, if we include snow in any search with other terms, we might reduce the number of documents we look at via the other terms but most likely we will still end up looking at most of the index/documents

The $text operator

  • The $text operator in a find command specifies that you want to use the full-text index
  • There can only be one full-text index per collection
  • If you want to have more than one document attribute included in the index, you specify that up front
    • db.tweets.ensureIndex({"text": "text", "user.description": "text"})

Query Syntax

  • The $search operator expects a string that contains the full-text search you would like to perform
  • A string with individual words is considered an OR query
  • Words surrounded by quotes is an exact phrase search
  • Words that have a hyphen in front of them are considered negations
    • {$search: "snow -day"}: find all tweets that had the word snow in it but did not have the word day in it.

Compound Text Indexes

  • You can create a compound index that can help limit the amount of search you do on the text index
  • db.tweets.ensureIndex({ "user.screen_name": 1, "text": "text"})
  • Now, when you do a search like this...
    • db.tweets.find({"user.screen_name": "kenbod", $text: { $search: "apple ipad iphone watch"}})
  • ... MongoDB will first limit the search to Prof. Anderson's tweets and then look for tweets containing the specified words.

Geospatial Indexes

  • MongoDB supports two types of geospatial indexes
    • We will be looking at just one called the 2dsphere index
  • This index requires that the data in the attribute that is being indexed is in GeoJSON format.
    • Mongo supports GeoJSON points, lines, and polygons

GeoJSON Examples

  • GeoJSON Point
    • { "type": "Point", "coordinates": [100.0, 0.0] }
    • Coordinates are specified [longitude, latitude]
  • GeoJSON LineString
    • { "type": "LineString", "coordinates": [ [100.0, 0.0], [101.0, 1.0] ] }
  • GeoJSON Polygon
    • { "type": "Polygon", "coordinates": [[[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]]]}
    • Coordinates is an array of arrays of points. The first subarray specifies the exterior of the polygon. Any other arrays specify interior holes.

Creating a 2dsphere Index

  • Creating a 2dsphere index is simple (once you know that your values are properly formatted GeoJSON)
  • In our tweets collection, the "coordinates" attribute is formatted in this way
  • The entire collection has 625,800 tweets
    • Of these, 17,242 are geo-tagged
    • I used this query to discover that number: db.tweets.find({'coordinates': {$exists: true}}).count()
  • To create the index:
    • db.tweets.ensureIndex({"coordinates": "2dsphere"})

Creating a Query (1)

  • In most cases, you want to specify a bounding box for the queries that you will make against this index
  • The available query operators are:
    • $geoWithin: Find objects that are completely surrounded by a given bounding box
    • $geoIntersects: Find objects that intersect a given bounding box
    • $near and $nearSphere: Find objects that are near a given point. The two operators differ by how the calculation is performed. They return objects from nearest to farthest
      • $minDistance and $maxDistance (both specified in meters) can be used to constrain the search

Creating a Query (2)

Creating a Query (3)

  • Using that website, I created this bounding box for the City of Boulder

"geometry": {
  "type": "Polygon",
  "coordinates": [
    [
      [
        -105.2548599243164,
        39.957911932580835
      ],
      [
        -105.29022216796875,
        39.989221102071994
      ],
      [
        -105.29193878173828,
        40.063621419947914
      ],
      [
        -105.23975372314453,
        40.05337319344778
      ],
      [
        -105.22361755371092,
        39.98027708862263
      ],
      [
        -105.2548599243164,
        39.957911932580835
      ]
    ]
  ]
}
          

Creating a Query (4)

  • We can now create a query
  • Assume that the bounding box is stored in a variable called cityOfBoulder
  • To find the tweets that are within the City of Boulder, we can use this query:
    • db.tweets.find({"coordinates": { $geoWithin: { $geometry: cityOfBoulder} } })
    • It turns out that 30 of the 600K tweets are from our own city
  • The execution stats show that the index is making the query effecient

"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 30,
  "executionTimeMillis" : 0,
  "totalKeysExamined" : 32,
  "totalDocsExamined" : 30,
          

Final Query

  • Let's see which of those tweets was created closest to our classroom
  • I used GeoJSON.io to create a point almost on top of our classroom:

"geometry": {
  "type": "Point",
  "coordinates": [
    -105.26360392570496,
    40.00682620659043
  ]
}
          
  • Assume this value is stored in a variable called ClassRoom
  • I can now ask MongoDB which of the 30 tweets was produced closest to our classroom:
  • db.tweets.find({"coordinates": {$nearSphere: { $geometry: ClassRoom }}})

Map Reduce

  • Mongo offers the ability to create new collections from old collections using a technique called Map Reduce
  • Map Reduce takes two functions: a map function and a reduce function
  • It can also take a query, if so, it first finds the results of the query and then applies MapReduce to the result set

Map Functions

  • A map function in MongoDB is a javascript function
  • This function takes no parameters but it is called such that the this variable points at a document in the input collection
  • The function can perform calculations if it wants and then emit zero or more documents based on the information it finds
  • Here's a function that can be used to count up the number of times a screen name appears in a collection
  • 
    function () {
      emit(this.user.screen_name, {count: 1});
    }
                

Reduce Functions

  • A reduce function takes a key and a set of produced documents that were previously emited
  • It must return a document that can be sent back into the reduce function in a subsequent phase
  • For our example, that means it must generate documents of the form {count: X} where x is a number

function (key, docs) {
  var total = 0;
  for (var i = 0; i < docs.length; i++) {
    total += docs[i].count;
  }
  return {count : total};
}
          

The Result

  • Assuming the previous functions are in variables map and reduce, you can run this command like this:
    • db.tweets.mapReduce(map, reduce, {out: "users"});
  • At the end of the operation, you will have a set of documents in a new collection
  • Each document will have one key (in this case a screen_name) and a value that is equal to the number of times it appeared in the collection

Wrap Up

  • Hopefully, you now have a good solid introduction to MongoDB
  • You can use this understanding to now get up to speed with other NoSQL document databases that you encounter