Lecture 16 — Data Engineering

MongoDB Indexes

Lecture 16 — Data Engineering — Spring 2015

March 5, 2015

Overview

Import tweets into MongoDB
Perform queries with no indexes
Create indexes
Perform queries with indexes to compare
Advanced indexes:
- compound indexes

Importing Tweets

Let's construct a program for importing tweets into MongoDB
Assumptions:
- Input file: tweet objects in JSON format, one per line
- Name of Database: data
- Name of Collection: tweets
- We will use the mongo Ruby gem for accessing MongoDB

Getting Started


mkdir import_tweets
cd import_tweets
vi Gemfile

The contents of Gemfile should look like this:


source 'https://rubygems.org'
gem 'mongo'

Then type the following:


bundle install
vi import.rb

Initial Version

The first version of import.rb simply reads its input file.


require 'bundler/setup'

require 'date'
require 'json'
require 'mongo'
require 'time'

if __FILE__ == $0

  STDOUT.sync = true

  input_file = ARGV[0]

  IO.foreach(input_file) do |line|
    tweet = JSON.parse(line.chomp)
  end

end

Note: The program doesn't do anything yet!

Connecting to Mongo

Using the 'mongo' Ruby Gem

Step	Command	Description
First	include Mongo	Import the Mongo module
Second	mongo = MongoClient.new	Connection to MongoDB
Third	db = mongo.db('data')	Get a handle to the 'data' database
Fourth	tweets = db['tweets']	Get a handle to the 'tweets' collection
When Done	mongo.close	Close the connection to MongoDB

Back to Import.rb


require 'bundler/setup'

require 'date'
require 'json'
require 'mongo'
require 'time'

include Mongo

if __FILE__ == $0

  STDOUT.sync = true

  input_file = ARGV[0]

  mongo  = MongoClient.new
  db     = mongo.db('data')
  tweets = db['tweets']

  IO.foreach(input_file) do |line|
    tweet = JSON.parse(line.chomp)
  end

  mongo.close

end

Still does not do anything, except crash if you run it!

Why? It needs MongoDB to be running now.

Running/Checking MongoDB

In order for our program to run, MongoDB needs to be running
mongod --config /usr/local/etc/mongod.conf
OR: mongod --dbpath <path to data directory>
Then launch the mongo client with: mongo
Type: show dbs to see that the 'data' db does not exist

Before We Import (1)

Before we import tweets into MongoDB
- We should clean up our data
For our JSON object, most of the values of our keys are strings
That works fine for most attributes except for created_at
created_at is a time stamp and looks like this:
- "Sun Feb 15 02:41:32 +0000 2015"
In order to do queries by time, we need to store this value not as a string but as a time object

Before We Import (2)

Ruby makes the conversion easy
- t = DateTime.parse("Sun Feb 15 02:41:32 +0000 2015")
- return t.to_time.utc
Note: The conversion performed by to_time() converts the time to your current time zone. The call to utc() converts it back to the UTC timezone.

Before We Import (3)

There are four places where created_at can be found on a tweet
- tweet['created_at']
- tweet['user']['created_at']
- tweet['retweeted_status']['created_at']
- tweet['retweeted_status']['user']['created_at']
Not all of these locations exist for each tweet

Before We Import (4)

The code to do this conversion is out-of-scope
- It's in the released version of the script
The key thing is before the conversion, the date looks like this:
- "Sun Feb 15 02:41:32 +0000 2015"
After the conversion, the date looks like this:
- 2015-02-15 02:41:32 UTC
That is the output of Ruby's built-in time object, which is what we want

Before We Import (5)

The other aspect of the data that we need to clean is the coordinates field
We will use this field to do geospatial queries in Mongo

Before We Import (6)

The coordinates field has one of three values

null: true for the vast majority of tweets

Empty:


{
  "type": "Point",
  "coordinates": [
    0,
    0
  ]
}

Valid:


{
  "type": "Point",
  "coordinates": [
    2.35247594,
    48.86454076
  ]
}

Before We Import (7)

The problem: Mongo is finicky.
- It can't handle the null values in its geospatial index, so we have to get rid of them
- The empty values don't make sense and would just confuse our queries, so we need to get rid of them too
Reflection: This type of cleanup and manipulation happens ALL THE TIME when working with data sets (big or small)

Before We Import (8)

Fortunately, the code to do this is straightforward


def remove_coordinates_if_needed(tweet)
  if tweet['coordinates'].nil?
    tweet.delete('coordinates')
    return
  end
  value = tweet['coordinates']['coordinates'][0]
  if value == 0
    tweet.delete('coordinates')
  end
end

Back to Import.rb

Our main routine now looks like this


if __FILE__ == $0

  STDOUT.sync = true

  input_file = ARGV[0]

  mongo  = MongoClient.new
  db     = mongo.db('data')
  tweets = db['tweets']

  IO.foreach(input_file) do |line|
    tweet = JSON.parse(line.chomp)
    convert_created_at(tweet)
    remove_coordinates_if_needed(tweet)
  end

  mongo.close

end

The Actual Import...

All we need to do now is actually put the tweet into the database
The code to do that is underwhelming:
```
tweets.insert( tweet )
            
```
The code is so simple because Ruby Hashes can easily be converted to BSON (i.e. JSON) which is then stored natively by MongoDB
- The Hash table is essentially MongoDB's native format

Last Minute Additions

I also added a function to the import script to print out our progress through the input file
- It's the more advanced version of the function I included in the CouchDB import script
- Details are available in the final source code
Note: I haven't discussed indexes yet, that's next
Let's import big_data_tweets.json file that I created for the CouchDB lecture
ruby import.rb big_data_tweets.json

After the Import

After the import, we can learn a few things using the mongo command line tool


$ mongo
> show dbs
data           0.953GB
> use data
switched to db data
> show collections
system.indexes
tweets
> db.tweets.count()
100681
> db.tweets.findOne()
<displays a tweet object>

Why are indexes helpful? (1)

Let's see if we can demonstrate the value of indexes
Currently, we have no user-defined indexes created for this collection
Let's say we want to see the tweets with the highest retweet_count


db.tweets.find().sort({'retweet_count':-1})
error: {
  "$err" : "Runner error: Overflow sort stage buffered data usage of 33557570 bytes exceeds internal limit of 33554432 bytes",
  "code" : 17144
}

Hmm. Not helpful.

Why are indexes helpful? (2)

Let's see if we can find the highest retweeted tweets by specifying a query


db.tweets.find({'retweet_count': {'$gt': 99}}).sort({'retweet_count':-1})
<short pause>
<tweet object with "retweet_count" : 2121>
... more tweets ...

That worked but let's see what's happening behind the scenes.
We can ask Mongo to explain the query

Why are indexes helpful? (3)

The explain function will return information about how MongoDB will process any given query


> db.tweets.find({'retweet_count': {'$gt': 99}}).explain()
{
  "cursor" : "BasicCursor",
  "isMultiKey" : false,
  "n" : 827,
  "nscannedObjects" : 100681,
  "nscanned" : 100681,
  "nscannedObjectsAllPlans" : 100681,
  "nscannedAllPlans" : 100681,
  "scanAndOrder" : false,
  "indexOnly" : false,
  "nYields" : 786,
  "nChunkSkips" : 0,
  "millis" : 143,
  "server" : "Resolution.local:27017",
  "filterSet" : false
}

What did that mean?

The key fields in that information are:
- cursor
- n
- nscanned*
Mongo is telling us the following things
- Cursor == "BasicCursor"
- n == 827
- nscannedObjects == 100681
In English: "There are 827 tweets that have a retweeted_count greater than 99. I had to look at every tweet in the database to discover this!"
The short pause after we submitted this query was the result of having to wait for MongoDb to scan every object in the database

Indexes to the Rescue!

To solve these problems, we need an index
An index will allow us to sort tweets and to make our queries much faster (especially as we get more data)
The cost? Disk Space and Memory: indexes take more disk space to store and memory to access
Additional cost? Indexes can become tricky to set-up to optimize your queries to be as fast as possible.
We will work with indexes in the Mongo client first
- We will add index creation code to import.rb later

Creating an Index

To create an index, we use the ensureIndex command on our collection object. (Indexes are stored on collections, not databases.)
- It takes a Javascript object that specifies the keys (attributes) to index on and the direction (ascending or descending) to sort the values
In our case, we want to index the retweet_count field


db.tweets.ensureIndex({'retweet_count': 1})

The '1' means ascending. If we wanted descending, we would supply '-1'

Disclaimer

My slides are showing the output for MongoDB 2.X
I inadvertantly upgraded to MongoDB 3.0 while writing this lecture
- To my horror, they changed the behavior of the explain() command!
- To get similar output as to what I show in these slides, you must invoke explain with the following parameter: "executionStats"
I tell you, when creating lectures, something like this always comes up!

The Results? (1)


> db.tweets.find({'retweet_count': {'$gt': 99}}).explain()
{
  "cursor" : "BtreeCursor retweet_count_1",
  "isMultiKey" : false,
  "n" : 827,
  "nscannedObjects" : 827,
  "nscanned" : 827,
  "nscannedObjectsAllPlans" : 827,
  "nscannedAllPlans" : 827,
  "scanAndOrder" : false,
  "indexOnly" : false,
  "nYields" : 6,
  "nChunkSkips" : 0,
  "millis" : 4,
  "indexBounds" : {
    "retweet_count" : [
      [
        99,
        Infinity
      ]
    ]
  },
  "server" : "Resolution.local:27017",
  "filterSet" : false
}

Nice!

The Results? (2)

Our query now uses a BtreeCursor on the newly created index
The answer to the query is still 827 tweets
- But this time, it only scanned 827 index entrys (nscanned) and only scanned 827 documents (nscannedObjects).
It had to scan the documents because our query implicitly asked for the entire document to be returned.

The Results? (3)

Look what happens if we only ask for the retweet_count of each tweet
- Note: we have to tell MongoDB to not return the _id attribute for this to work


> db.tweets.find({'retweet_count': {'$gt': 99}}, {'retweet_count': 1, '_id': 0}).explain()
{
  "cursor" : "BtreeCursor retweet_count_1",
  "isMultiKey" : false,
  "n" : 827,
  "nscannedObjects" : 0,
  "nscanned" : 827,
  "nscannedObjectsAllPlans" : 0,
  "nscannedAllPlans" : 827,
  "scanAndOrder" : false,
  "indexOnly" : true,
  "nYields" : 6,
  "nChunkSkips" : 0,
  "millis" : 1,
  "indexBounds" : {
    "retweet_count" : [
      [
        99,
        Infinity
      ]
    ]
  },
  "server" : "Resolution.local:27017",
  "filterSet" : false
}

The Results? (4)

In this case, it scanned just the index (827 entries) but no documents (nscannedObjects == 0)
That's because all of the information it needed to return was located in the index.
- No need to even look at the associated documents!

The Results? (5)

Can we do our sort now?
- The one that failed previously?


db.tweets.find().sort({'retweet_count':-1})
<827 Tweets Go Here>
db.tweets.find().sort({'retweet_count':1})
<827 Tweets Go Here In The Opposite Direction>

Yep!

The Results? (6)

The reason for this is that the index has already done the work sorting the documents by retweet_count
- MongoDB can just use the index to return documents in sorted order
Why did it fail before?
- Without the index, MongoDB would need to read all of the documents into memory, then sort them by 'retweet_count' and then present the documents in sorted order
- The problem? MongoDB imposes a limit of 32 MB for doing in-memory sorts
- Our 100K tweets take up way more space than that!

Compound Indexes

Let's say we wanted to sort and query on two attributes
Example: I want to see the tweets first sorted by the user's created_at date and then by reverse order of the tweet's created_at date
This would show me the most recent tweets of the longest Twitter user in my data set first.
If I created an index on just one of these attributes, I wouldn't get what I want
- An index on tweet['created_at'] would just let me sort and query tweets based on when they were created, independent of user
- An index on tweet['user']['created_at'] would provide me with the tweets of the oldest (or youngest) Twitter user in my data set but with no guarantee on the order of their individual tweets

Quick Test Before Index Creation

db.tweets.find().sort({'user.created_at': 1, 'created_at': -1}): FAIL


> before2012 = new Date()
> before2012.setTime(Date.parse("Jan 1, 2012"))
> db.tweets.find({'user.created_at': {'$lt': before2012}}).explain("executionStats")
"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 44845,
  "executionTimeMillis" : 280,
  "totalKeysExamined" : 0,
  "totalDocsExamined" : 100681,
  ...
> afterFeb10 = new Date()
> afterFeb10.setTime(Date.parse("Feb 10, 2015"))
> db.tweets.find({'user.created_at': {'$lt': before2012}, 'created_at': {'$gt': afterFeb10}}).explain("executionStats")
"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 30885,
  "executionTimeMillis" : 122,
  "totalKeysExamined" : 0,
  "totalDocsExamined" : 100681,

FAIL

Creating the Compound Index

db.tweets.ensureIndex({'user.created_at': 1, 'created_at': -1});
The Results?


> db.tweets.find({'user.created_at': {'$lt': before2012}, 'created_at': {'$gt': afterFeb10}}).explain("executionStats")
"executionStats" : {
  "executionSuccess" : true,
  "nReturned" : 30885,
  "executionTimeMillis" : 63,
  "totalKeysExamined" : 39933,
  "totalDocsExamined" : 30885,

BOOM!

The sort command on the previous slide works as well, since it simply serves documents through the index.

More to Learn

We have now covered the basics of MongoDB indexes
There is more to learn.
- As you determine what queries you are going to make on your MongoDB collection, you use the explain() command to determine if they will use an index
- If they don't, you have to then figure out how to create an index that will make the query fast
- The trade-off is disk space

Coming Up Next

Next week, I will cover additional MongoDB indexes and capabilities
- Geospatial indexes
- Full text Indexes
- Map-Reduce Transformations on Collections