MongoDB Indexes
Lecture 16 — Data Engineering — Spring 2015
March 5, 2015
Overview
- Import tweets into MongoDB
- Perform queries with no indexes
- Create indexes
- Perform queries with indexes to compare
- Advanced indexes:
Importing Tweets
- Let's construct a program for importing tweets into MongoDB
- Assumptions:
- Input file: tweet objects in JSON format, one per line
- Name of Database: data
- Name of Collection: tweets
- We will use the mongo Ruby gem for accessing MongoDB
Getting Started
mkdir import_tweets
cd import_tweets
vi Gemfile
The contents of Gemfile should look like this:
source 'https://rubygems.org'
gem 'mongo'
Then type the following:
bundle install
vi import.rb
Initial Version
The first version of import.rb simply reads its input file.
require 'bundler/setup'
require 'date'
require 'json'
require 'mongo'
require 'time'
if __FILE__ == $0
STDOUT.sync = true
input_file = ARGV[0]
IO.foreach(input_file) do |line|
tweet = JSON.parse(line.chomp)
end
end
Note: The program doesn't do anything yet!
Connecting to Mongo
- Using the 'mongo' Ruby Gem
| Step |
Command |
Description |
| First |
include Mongo |
Import the Mongo module |
| Second |
mongo = MongoClient.new |
Connection to MongoDB |
| Third |
db = mongo.db('data') |
Get a handle to the 'data' database |
| Fourth |
tweets = db['tweets'] |
Get a handle to the 'tweets' collection |
| When Done |
mongo.close |
Close the connection to MongoDB |
Back to Import.rb
require 'bundler/setup'
require 'date'
require 'json'
require 'mongo'
require 'time'
include Mongo
if __FILE__ == $0
STDOUT.sync = true
input_file = ARGV[0]
mongo = MongoClient.new
db = mongo.db('data')
tweets = db['tweets']
IO.foreach(input_file) do |line|
tweet = JSON.parse(line.chomp)
end
mongo.close
end
Still does not do anything, except crash if you run it!
Why? It needs MongoDB to be running now.
Running/Checking MongoDB
- In order for our program to run, MongoDB needs to be running
- mongod --config /usr/local/etc/mongod.conf
- OR: mongod --dbpath <path to data directory>
- Then launch the mongo client with: mongo
- Type: show dbs to see that the 'data' db does not exist
Before We Import (1)
- Before we import tweets into MongoDB
- We should clean up our data
- For our JSON object, most of the values of our keys are strings
- That works fine for most attributes except for created_at
- created_at is a time stamp and looks like this:
- "Sun Feb 15 02:41:32 +0000 2015"
- In order to do queries by time, we need to store this value not as a string but as a time object
Before We Import (2)
- Ruby makes the conversion easy
- t = DateTime.parse("Sun Feb 15 02:41:32 +0000 2015")
- return t.to_time.utc
- Note: The conversion performed by to_time() converts the time to your current time zone. The call to utc() converts it back to the UTC timezone.
Before We Import (3)
- There are four places where created_at can be found on a tweet
- tweet['created_at']
- tweet['user']['created_at']
- tweet['retweeted_status']['created_at']
- tweet['retweeted_status']['user']['created_at']
- Not all of these locations exist for each tweet
Before We Import (4)
- The code to do this conversion is out-of-scope
- The key thing is before the conversion, the date looks like this:
- "Sun Feb 15 02:41:32 +0000 2015"
- After the conversion, the date looks like this:
- That is the output of Ruby's built-in time object, which is what we want
Before We Import (5)
- The other aspect of the data that we need to clean is the coordinates field
- We will use this field to do geospatial queries in Mongo
Before We Import (6)
The coordinates field has one of three values
- null: true for the vast majority of tweets
- Empty:
{
"type": "Point",
"coordinates": [
0,
0
]
}
- Valid:
{
"type": "Point",
"coordinates": [
2.35247594,
48.86454076
]
}
Before We Import (7)
- The problem: Mongo is finicky.
- It can't handle the null values in its geospatial index, so we have to get rid of them
- The empty values don't make sense and would just confuse our queries, so we need to get rid of them too
- Reflection: This type of cleanup and manipulation happens ALL THE TIME when working with data sets (big or small)
Before We Import (8)
Fortunately, the code to do this is straightforward
def remove_coordinates_if_needed(tweet)
if tweet['coordinates'].nil?
tweet.delete('coordinates')
return
end
value = tweet['coordinates']['coordinates'][0]
if value == 0
tweet.delete('coordinates')
end
end
Back to Import.rb
- Our main routine now looks like this
if __FILE__ == $0
STDOUT.sync = true
input_file = ARGV[0]
mongo = MongoClient.new
db = mongo.db('data')
tweets = db['tweets']
IO.foreach(input_file) do |line|
tweet = JSON.parse(line.chomp)
convert_created_at(tweet)
remove_coordinates_if_needed(tweet)
end
mongo.close
end
The Actual Import...
- All we need to do now is actually put the tweet into the database
- The code to do that is underwhelming:
tweets.insert( tweet )
- The code is so simple because Ruby Hashes can easily be converted to BSON (i.e. JSON) which is then stored natively by MongoDB
- The Hash table is essentially MongoDB's native format
Last Minute Additions
- I also added a function to the import script to print out our progress through the input file
- It's the more advanced version of the function I included in the CouchDB import script
- Details are available in the final source code
- Note: I haven't discussed indexes yet, that's next
- Let's import big_data_tweets.json file that I created for the CouchDB lecture
- ruby import.rb big_data_tweets.json
After the Import
- After the import, we can learn a few things using the mongo command line tool
$ mongo
> show dbs
data 0.953GB
> use data
switched to db data
> show collections
system.indexes
tweets
> db.tweets.count()
100681
> db.tweets.findOne()
<displays a tweet object>
Why are indexes helpful? (1)
- Let's see if we can demonstrate the value of indexes
- Currently, we have no user-defined indexes created for this collection
- Let's say we want to see the tweets with the highest retweet_count
db.tweets.find().sort({'retweet_count':-1})
error: {
"$err" : "Runner error: Overflow sort stage buffered data usage of 33557570 bytes exceeds internal limit of 33554432 bytes",
"code" : 17144
}
Hmm. Not helpful.
Why are indexes helpful? (2)
- Let's see if we can find the highest retweeted tweets by specifying a query
db.tweets.find({'retweet_count': {'$gt': 99}}).sort({'retweet_count':-1})
<short pause>
<tweet object with "retweet_count" : 2121>
... more tweets ...
- That worked but let's see what's happening behind the scenes.
- We can ask Mongo to explain the query
Why are indexes helpful? (3)
- The
explain function will return information about how MongoDB will process any given query
> db.tweets.find({'retweet_count': {'$gt': 99}}).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 827,
"nscannedObjects" : 100681,
"nscanned" : 100681,
"nscannedObjectsAllPlans" : 100681,
"nscannedAllPlans" : 100681,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 786,
"nChunkSkips" : 0,
"millis" : 143,
"server" : "Resolution.local:27017",
"filterSet" : false
}
What did that mean?
- The key fields in that information are:
- Mongo is telling us the following things
- Cursor == "BasicCursor"
- n == 827
- nscannedObjects == 100681
- In English: "There are 827 tweets that have a retweeted_count greater than 99. I had to look at every tweet in the database to discover this!"
- The short pause after we submitted this query was the result of having to wait for MongoDb to scan every object in the database
Indexes to the Rescue!
- To solve these problems, we need an index
- An index will allow us to sort tweets and to make our queries much faster (especially as we get more data)
- The cost? Disk Space and Memory: indexes take more disk space to store and memory to access
- Additional cost? Indexes can become tricky to set-up to optimize your queries to be as fast as possible.
- We will work with indexes in the Mongo client first
- We will add index creation code to import.rb later
Creating an Index
- To create an index, we use the ensureIndex command on our collection object. (Indexes are stored on collections, not databases.)
- It takes a Javascript object that specifies the keys (attributes) to index on and the direction (ascending or descending) to sort the values
- In our case, we want to index the retweet_count field
db.tweets.ensureIndex({'retweet_count': 1})
- The '1' means ascending. If we wanted descending, we would supply '-1'
Disclaimer
- My slides are showing the output for MongoDB 2.X
- I inadvertantly upgraded to MongoDB 3.0 while writing this lecture
- To my horror, they changed the behavior of the explain() command!
- To get similar output as to what I show in these slides, you must invoke explain with the following parameter: "executionStats"
- I tell you, when creating lectures, something like this always comes up!
The Results? (1)
> db.tweets.find({'retweet_count': {'$gt': 99}}).explain()
{
"cursor" : "BtreeCursor retweet_count_1",
"isMultiKey" : false,
"n" : 827,
"nscannedObjects" : 827,
"nscanned" : 827,
"nscannedObjectsAllPlans" : 827,
"nscannedAllPlans" : 827,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 6,
"nChunkSkips" : 0,
"millis" : 4,
"indexBounds" : {
"retweet_count" : [
[
99,
Infinity
]
]
},
"server" : "Resolution.local:27017",
"filterSet" : false
}
Nice!
The Results? (2)
- Our query now uses a BtreeCursor on the newly created index
- The answer to the query is still 827 tweets
- But this time, it only scanned 827 index entrys (nscanned) and only scanned 827 documents (nscannedObjects).
- It had to scan the documents because our query implicitly asked for the entire document to be returned.
The Results? (3)
- Look what happens if we only ask for the retweet_count of each tweet
- Note: we have to tell MongoDB to not return the _id attribute for this to work
> db.tweets.find({'retweet_count': {'$gt': 99}}, {'retweet_count': 1, '_id': 0}).explain()
{
"cursor" : "BtreeCursor retweet_count_1",
"isMultiKey" : false,
"n" : 827,
"nscannedObjects" : 0,
"nscanned" : 827,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 827,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 6,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"retweet_count" : [
[
99,
Infinity
]
]
},
"server" : "Resolution.local:27017",
"filterSet" : false
}
The Results? (4)
- In this case, it scanned just the index (827 entries) but no documents (nscannedObjects == 0)
- That's because all of the information it needed to return was located in the index.
- No need to even look at the associated documents!
The Results? (5)
- Can we do our sort now?
- The one that failed previously?
db.tweets.find().sort({'retweet_count':-1})
<827 Tweets Go Here>
db.tweets.find().sort({'retweet_count':1})
<827 Tweets Go Here In The Opposite Direction>
Yep!
The Results? (6)
- The reason for this is that the index has already done the work sorting the documents by retweet_count
- MongoDB can just use the index to return documents in sorted order
- Why did it fail before?
- Without the index, MongoDB would need to read all of the documents into memory, then sort them by 'retweet_count' and then present the documents in sorted order
- The problem? MongoDB imposes a limit of 32 MB for doing in-memory sorts
- Our 100K tweets take up way more space than that!
Compound Indexes
- Let's say we wanted to sort and query on two attributes
- Example: I want to see the tweets first sorted by the user's created_at date and then by reverse order of the tweet's created_at date
- This would show me the most recent tweets of the longest Twitter user in my data set first.
- If I created an index on just one of these attributes, I wouldn't get what I want
- An index on tweet['created_at'] would just let me sort and query tweets based on when they were created, independent of user
- An index on tweet['user']['created_at'] would provide me with the tweets of the oldest (or youngest) Twitter user in my data set but with no guarantee on the order of their individual tweets
Quick Test Before Index Creation
- db.tweets.find().sort({'user.created_at': 1, 'created_at': -1}): FAIL
> before2012 = new Date()
> before2012.setTime(Date.parse("Jan 1, 2012"))
> db.tweets.find({'user.created_at': {'$lt': before2012}}).explain("executionStats")
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 44845,
"executionTimeMillis" : 280,
"totalKeysExamined" : 0,
"totalDocsExamined" : 100681,
...
> afterFeb10 = new Date()
> afterFeb10.setTime(Date.parse("Feb 10, 2015"))
> db.tweets.find({'user.created_at': {'$lt': before2012}, 'created_at': {'$gt': afterFeb10}}).explain("executionStats")
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 30885,
"executionTimeMillis" : 122,
"totalKeysExamined" : 0,
"totalDocsExamined" : 100681,
FAIL
Creating the Compound Index
- db.tweets.ensureIndex({'user.created_at': 1, 'created_at': -1});
- The Results?
> db.tweets.find({'user.created_at': {'$lt': before2012}, 'created_at': {'$gt': afterFeb10}}).explain("executionStats")
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 30885,
"executionTimeMillis" : 63,
"totalKeysExamined" : 39933,
"totalDocsExamined" : 30885,
BOOM!
The sort command on the previous slide works as well, since it simply serves documents through the index.
More to Learn
- We have now covered the basics of MongoDB indexes
- There is more to learn.
- As you determine what queries you are going to make on your MongoDB collection, you use the explain() command to determine if they will use an index
- If they don't, you have to then figure out how to create an index that will make the query fast
- The trade-off is disk space
Coming Up Next
- Next week, I will cover additional MongoDB indexes and capabilities
- Geospatial indexes
- Full text Indexes
- Map-Reduce Transformations on Collections