Getting Data from Twitter
Take Two
Lecture 10 — Data Engineering — Spring 2015
February 12, 2015
Consumer Keys, Access Tokens
- Access to Twitter's APIs are secured by OAuth
- OAuth is a Web-based security protocol that allows a web service to associate service invocations with users and applications.
- With Twitter, a consumer key identifies a particular application (and developer).
- An access token identifies a particular user.
Consumer Keys, Access Tokens
- OAuth allows an application to act on behalf of many users.
- You might ship your application with its consumer keys but with no access tokens.
- When a user launches your app for the first time, you send them to Twitter to login and grant access to your account to the app.
- Twitter will then send an access token/secret for your application to store on behalf of the user.
Consumer Keys, Access Tokens
- With this information, you can do things like
Post this tweet on behalf of user kenbod, even though you otherwise are not connected with that user in any way.
- When you create an app on Twitter and generate an access token, you short circuit this Web-based authentication process.
- This allows you to quickly create apps that focus on data collection.
oauth.properties
- The first step in accessing Twitter data via the REST, Search, and Streaming APIs is to create a file called oauth.properties
- It should have a single JSON object inside with these properties:

- Fill in the fields with your own values.
Get Tweets
- Let's start with a program that retrieves a single page of tweets for a given user id.
- We'll expand that program to get multiple pages later.
- The maximum Twitter will let you get is the last 3200 tweets of a given user.
- (That's 16 pages of 200 tweets each.)
- Create a directory called
get_tweets
- Place a copy of oauth.properties in that directory.
Prerequisites
- You need to have access to these commands
- My recommendation:
- Use rbenv to manage ruby on your machine!
rbenv set-up (optional)
- Once you have rbenv and ruby-build installed:
- rbenv install 2.2.0
- rbenv global 2.2.0
- gem install bundler
- rbenv rehash
Verify Installation (optional)
| Command |
Output |
| rbenv version |
2.2.0 (set by $HOME/.rbenv/version) |
| ruby --version |
ruby 2.2.0p0 ... |
| which bundler |
$HOME/.rbenv/shims/bundler |
Back to get_tweets
- Head back to the get_tweets directory
- If you're using rbenv, type rbenv local 2.2.0
- This ensures that this particular app will always use 2.2.0...
- ... even if you switch to a new version globally.
- The idea is that rbenv allows you to accurately recreate the environment under which a ruby app was developed.
- Bundler enables the same thing. We'll use it next.
Gemfile
- We will use Bundler to keep track of the gems this application uses
- Create a file called Gemfile in your directory. Edit it to contain:
source 'https://rubygems.org'
gem 'simple_oauth'
gem 'typhoeus'
- Now install these gems by typing:
bundle install
- This will create a file called Gemfile.lock that records the specific versions of the gems that you installed.
Gemfile.lock
- On my machine, Gemfile.lock contains:
GEM
remote: https://rubygems.org/
specs:
ethon (0.7.2)
ffi (>= 1.3.0)
ffi (1.9.6)
simple_oauth (0.3.1)
typhoeus (0.7.1)
ethon (>= 0.7.1)
PLATFORMS
ruby
DEPENDENCIES
simple_oauth
typhoeus
Timeline.rb
- Create a file called Timeline.rb
- This file will define a class called Timeline
- Timeline will be responsible for:
- making a single API request on a user's timeline
- handling an error if it should occur
- on success, it will yield tweets for processing
- We will develop this class in stages
Initial Class
require 'bundler/setup'
require 'json'
require 'simple_oauth'
require 'typhoeus'
class Timeline
def initialize(args)
@parser = URI::Parser.new
@props = args.fetch(:props)
@screen_name = args.fetch(:screen_name)
@url = url
end
def url
'https://api.twitter.com/1.1/statuses/user_timeline.json'
end
private :url
end
get_tweets.rb
- Let's also create a file called
get_tweets.rb
- This file will be our main program
- It will do the following:
- Load up our oauth properties file
- Read the desired screen_name from the command line
- Instantiate the Timeline class
- Invoke the request for the first page of tweets
- Store any returned tweets in a file called tweets.json
Initial get_tweets.rb
require_relative 'Timeline'
if __FILE__ == $0
STDOUT.sync = true
if ARGV.length != 1
puts "Usage: ruby get_tweets.rb <screen_name>"
exit(1)
end
screen_name = ARGV[0]
# load oauth.properties
# instantiate Timeline
# make the request and store tweets in tweets.json
end
Discussion
- require is how ruby imports gems
- We require bundler/setup to ensure that we're using the correct version of the gems we depend on
- require_relative is how we import other files
if __FILE__ == $0 determines if we've been launched from the command line
STDOUT.sync = true asks for non-buffered output
ARGV contains the command line arguments that appear after ruby get_tweets.rb
Load the Tokens
- Let's handle oauth.properties first
- We need the tokens in a Hash
def load_props
input = File.open('oauth.properties')
JSON.parse(input.read)
end
def convert_props(input)
props = {}
input.keys.each do |key|
props[key.to_sym] = input[key]
end
props
end
props = convert_props(load_props)
Instantiate Timeline
- We now have enough information to instantiate the Timeline class
args = {props: props, screen_name: screen_name}
twitter = Timeline.new(args)
- Of course, Timeline doesn't do anything yet! Let's fix that!
Utilities (1)
- Timeline needs a number of utility functions
- Since we will need to pass a screen_name as part of the URL, we need to make sure it is properly escaped.
- The simple_oauth gem provides a URI Parser that will do the trick
require 'simple_oauth'
parser = URI::Parser.new
parser.escape("kenbod,twitter", /[^a-z0-9\-\.\_\~]/i)
# outputs 'kenbod%2Ctwitter'
prepare()
- Given this info, we add a prepare method to Timeline
def prepare(param)
@parser.escape(param.to_s, /[^a-z0-9\-\.\_\~]/i)
end
- We will use this function to prepare any params that get passed with our URL
- For example:
- https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=kenbod
Utilities (2)
- Twitter requires that we sign every request with an Authorization header
- This is really hard to do; fortunately simple_oauth makes it... simple!
- To sign the request, we have to pass the method, url, params, and our tokens
- We will create a method called authorization to generate this header
authorization
def authorization
params = {
'screen_name' => prepare(@screen_name),
'count' => prepare('200') }
header = SimpleOAuth::Header.new("GET", @url, params, @props)
{ 'Authorization' => header.to_s }
end
private :authorization, :prepare, :url
- Note: all utility methods are marked as private
Utilities (3)
- Our second-to-last utility helps us configure the Typhoeus request
- We will call it options; it will set the HTTP method, our headers, and our params
def options
options = {}
options[:method] = :get
options[:headers] = authorization
options[:params] = { screen_name: @screen_name, count: 200 }
options
end
Utilities (4)
- Our last utility makes a request and returns the response
- We will call it make_request
def make_request
request = Typhoeus::Request.new(@url, options)
request.run
end
Collecting the Tweets
- Now, we just have to go and collect the tweets
def collect
puts "REQUESTING : #{Time.now}"
response = make_request
if response.code == 200
puts "SUCCESS : #{Time.now}"
tweets = JSON.parse(response.body)
puts "#{tweets.size} tweet(s) received."
yield tweets
else
puts "FAILURE : #{Time.now}"
puts "Response Code: #{response.code}"
puts "Response Info: #{response.status_message}"
exit(1)
end
end
Updating get_tweets
- Now that Timeline can actually do something, we have to update get_tweets to take advantage of it!
puts "Collecting 200 most recent tweets for '#{screen_name}'"
twitter.collect do |tweets|
File.open('tweets.json', 'w') do |f|
tweets.each do |tweet|
f.puts "#{tweet.to_json}\n"
end
end
end
puts "DONE."
Run it!
- Let's get 200 of the Bad Astronomer's tweets
$ ruby get_tweets.rb badastronomer
Collecting 200 most recent tweets for 'badastronomer'
REQUESTING : 2015-02-11 22:42:03 -0700
SUCCESS : 2015-02-11 22:42:05 -0700
200 tweet(s) received.
DONE.
- The output is pretty dense: big JSON objects, one per line!
jq to the rescue!
- To work with the tweets.json file, make use of jq
- To pretty print the file:
- jq is more than just a pretty printer. Try:
- Fun!
Wrapping Up
- We've now got code to get one page of tweets
- Next week, we will update this code to get all 3200 tweets
- Also next week, working with the Streaming API
- After that: NoSQL!