Getting Data from Twitter
Take Two

Lecture 10 — Data Engineering — Spring 2015

February 12, 2015

Consumer Keys, Access Tokens

  • Access to Twitter's APIs are secured by OAuth
  • OAuth is a Web-based security protocol that allows a web service to associate service invocations with users and applications.
  • With Twitter, a consumer key identifies a particular application (and developer).
  • An access token identifies a particular user.

Consumer Keys, Access Tokens

  • OAuth allows an application to act on behalf of many users.
  • You might ship your application with its consumer keys but with no access tokens.
  • When a user launches your app for the first time, you send them to Twitter to login and grant access to your account to the app.
  • Twitter will then send an access token/secret for your application to store on behalf of the user.

Consumer Keys, Access Tokens

  • With this information, you can do things like Post this tweet on behalf of user kenbod, even though you otherwise are not connected with that user in any way.
  • When you create an app on Twitter and generate an access token, you short circuit this Web-based authentication process.
  • This allows you to quickly create apps that focus on data collection.

oauth.properties

  • The first step in accessing Twitter data via the REST, Search, and Streaming APIs is to create a file called oauth.properties
  • It should have a single JSON object inside with these properties:

  • Fill in the fields with your own values.

Get Tweets

  • Let's start with a program that retrieves a single page of tweets for a given user id.
  • We'll expand that program to get multiple pages later.
  • The maximum Twitter will let you get is the last 3200 tweets of a given user.
    • (That's 16 pages of 200 tweets each.)
  • Create a directory called get_tweets
  • Place a copy of oauth.properties in that directory.

Prerequisites

  1. You need to have access to these commands
    • ruby
    • gem
  2. My recommendation:
    • Use rbenv to manage ruby on your machine!

rbenv set-up (optional)

  • Once you have rbenv and ruby-build installed:
    1. rbenv install 2.2.0
    2. rbenv global 2.2.0
    3. gem install bundler
    4. rbenv rehash

Verify Installation (optional)

Command Output
rbenv version 2.2.0 (set by $HOME/.rbenv/version)
ruby --version ruby 2.2.0p0 ...
which bundler $HOME/.rbenv/shims/bundler

Back to get_tweets

  • Head back to the get_tweets directory
  • If you're using rbenv, type rbenv local 2.2.0
    • This ensures that this particular app will always use 2.2.0...
    • ... even if you switch to a new version globally.
    • The idea is that rbenv allows you to accurately recreate the environment under which a ruby app was developed.
    • Bundler enables the same thing. We'll use it next.

Gemfile

  • We will use Bundler to keep track of the gems this application uses
  • Create a file called Gemfile in your directory. Edit it to contain:

source 'https://rubygems.org'
gem 'simple_oauth'
gem 'typhoeus'
          
  • Now install these gems by typing: bundle install
  • This will create a file called Gemfile.lock that records the specific versions of the gems that you installed.

Gemfile.lock

  • On my machine, Gemfile.lock contains:

GEM
  remote: https://rubygems.org/
  specs:
    ethon (0.7.2)
      ffi (>= 1.3.0)
    ffi (1.9.6)
    simple_oauth (0.3.1)
    typhoeus (0.7.1)
      ethon (>= 0.7.1)

PLATFORMS
  ruby

DEPENDENCIES
  simple_oauth
  typhoeus
          

Timeline.rb

  • Create a file called Timeline.rb
  • This file will define a class called Timeline
  • Timeline will be responsible for:
    • making a single API request on a user's timeline
    • handling an error if it should occur
    • on success, it will yield tweets for processing
  • We will develop this class in stages

Initial Class


require 'bundler/setup'

require 'json'
require 'simple_oauth'
require 'typhoeus'

class Timeline

  def initialize(args)
    @parser      = URI::Parser.new
    @props       = args.fetch(:props)
    @screen_name = args.fetch(:screen_name)
    @url         = url
  end

  def url
    'https://api.twitter.com/1.1/statuses/user_timeline.json'
  end

  private :url

end
          

get_tweets.rb

  • Let's also create a file called get_tweets.rb
  • This file will be our main program
  • It will do the following:
    • Load up our oauth properties file
    • Read the desired screen_name from the command line
    • Instantiate the Timeline class
    • Invoke the request for the first page of tweets
    • Store any returned tweets in a file called tweets.json

Initial get_tweets.rb


require_relative 'Timeline'

if __FILE__ == $0

  STDOUT.sync = true

  if ARGV.length != 1
    puts "Usage: ruby get_tweets.rb <screen_name>"
    exit(1)
  end

  screen_name = ARGV[0]

  # load oauth.properties
  # instantiate Timeline
  # make the request and store tweets in tweets.json

end
          

Discussion

  • require is how ruby imports gems
  • We require bundler/setup to ensure that we're using the correct version of the gems we depend on
  • require_relative is how we import other files
  • if __FILE__ == $0 determines if we've been launched from the command line
  • STDOUT.sync = true asks for non-buffered output
  • ARGV contains the command line arguments that appear after ruby get_tweets.rb

Load the Tokens

  • Let's handle oauth.properties first
  • We need the tokens in a Hash

def load_props
  input = File.open('oauth.properties')
  JSON.parse(input.read)
end

def convert_props(input)
  props = {}
  input.keys.each do |key|
    props[key.to_sym] = input[key]
  end
  props
end
          
  • In the main body

props = convert_props(load_props)
          

Instantiate Timeline

  • We now have enough information to instantiate the Timeline class

args = {props: props, screen_name: screen_name}
twitter = Timeline.new(args)
          
  • Of course, Timeline doesn't do anything yet! Let's fix that!

Utilities (1)

  • Timeline needs a number of utility functions
  • Since we will need to pass a screen_name as part of the URL, we need to make sure it is properly escaped.
  • The simple_oauth gem provides a URI Parser that will do the trick

require 'simple_oauth'
parser = URI::Parser.new
parser.escape("kenbod,twitter", /[^a-z0-9\-\.\_\~]/i)
# outputs 'kenbod%2Ctwitter'
          

prepare()

  • Given this info, we add a prepare method to Timeline

def prepare(param)
  @parser.escape(param.to_s, /[^a-z0-9\-\.\_\~]/i)
end
          
  • We will use this function to prepare any params that get passed with our URL
  • For example:
    • https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=kenbod

Utilities (2)

  • Twitter requires that we sign every request with an Authorization header
  • This is really hard to do; fortunately simple_oauth makes it... simple!
  • To sign the request, we have to pass the method, url, params, and our tokens
  • We will create a method called authorization to generate this header

authorization


def authorization
  params = {
    'screen_name' => prepare(@screen_name),
    'count' => prepare('200') }
  header = SimpleOAuth::Header.new("GET", @url, params, @props)
  { 'Authorization' => header.to_s }
end

private :authorization, :prepare, :url
          
  • Note: all utility methods are marked as private

Utilities (3)

  • Our second-to-last utility helps us configure the Typhoeus request
  • We will call it options; it will set the HTTP method, our headers, and our params

def options
  options = {}
  options[:method]  = :get
  options[:headers] = authorization
  options[:params]  = { screen_name: @screen_name, count: 200 }
  options
end
          

Utilities (4)

  • Our last utility makes a request and returns the response
  • We will call it make_request

def make_request
  request = Typhoeus::Request.new(@url, options)
  request.run
end
          

Collecting the Tweets

  • Now, we just have to go and collect the tweets

def collect
  puts "REQUESTING     : #{Time.now}"
  response = make_request
  if response.code == 200
    puts "SUCCESS      : #{Time.now}"
    tweets = JSON.parse(response.body)
    puts "#{tweets.size} tweet(s) received."
    yield tweets
  else
    puts "FAILURE      : #{Time.now}"
    puts "Response Code: #{response.code}"
    puts "Response Info: #{response.status_message}"
    exit(1)
  end
end
          

Updating get_tweets

  • Now that Timeline can actually do something, we have to update get_tweets to take advantage of it!

puts "Collecting 200 most recent tweets for '#{screen_name}'"

twitter.collect do |tweets|
  File.open('tweets.json', 'w') do |f|
    tweets.each do |tweet|
      f.puts "#{tweet.to_json}\n"
    end
  end
end

puts "DONE."
          

Run it!

  • Let's get 200 of the Bad Astronomer's tweets

$ ruby get_tweets.rb badastronomer
Collecting 200 most recent tweets for 'badastronomer'
REQUESTING   : 2015-02-11 22:42:03 -0700
SUCCESS      : 2015-02-11 22:42:05 -0700
200 tweet(s) received.
DONE.
          
  • The output is pretty dense: big JSON objects, one per line!

jq to the rescue!

  • To work with the tweets.json file, make use of jq
  • To pretty print the file:
    • jq '.' tweets.json
  • jq is more than just a pretty printer. Try:
    • jq '.text' tweets.json
  • Fun!

Wrapping Up

  • We've now got code to get one page of tweets
  • Next week, we will update this code to get all 3200 tweets
  • Also next week, working with the Streaming API
  • After that: NoSQL!