Lecture 10 — Data Engineering

Getting Data from Twitter
Take Two

Lecture 10 — Data Engineering — Spring 2015

February 12, 2015

Consumer Keys, Access Tokens

Access to Twitter's APIs are secured by OAuth
OAuth is a Web-based security protocol that allows a web service to associate service invocations with users and applications.
With Twitter, a consumer key identifies a particular application (and developer).
An access token identifies a particular user.

Consumer Keys, Access Tokens

OAuth allows an application to act on behalf of many users.
You might ship your application with its consumer keys but with no access tokens.
When a user launches your app for the first time, you send them to Twitter to login and grant access to your account to the app.
Twitter will then send an access token/secret for your application to store on behalf of the user.

Consumer Keys, Access Tokens

With this information, you can do things like Post this tweet on behalf of user kenbod, even though you otherwise are not connected with that user in any way.
When you create an app on Twitter and generate an access token, you short circuit this Web-based authentication process.
This allows you to quickly create apps that focus on data collection.

oauth.properties

The first step in accessing Twitter data via the REST, Search, and Streaming APIs is to create a file called oauth.properties
It should have a single JSON object inside with these properties:
Fill in the fields with your own values.

Get Tweets

Let's start with a program that retrieves a single page of tweets for a given user id.
We'll expand that program to get multiple pages later.
The maximum Twitter will let you get is the last 3200 tweets of a given user.
- (That's 16 pages of 200 tweets each.)
Create a directory called get_tweets
Place a copy of oauth.properties in that directory.

Prerequisites

You need to have access to these commands
- ruby
- gem
My recommendation:
- Use rbenv to manage ruby on your machine!

rbenv set-up (optional)

Once you have rbenv and ruby-build installed:
1. rbenv install 2.2.0
2. rbenv global 2.2.0
3. gem install bundler
4. rbenv rehash

Verify Installation (optional)

Command	Output
rbenv version	2.2.0 (set by $HOME/.rbenv/version)
ruby --version	ruby 2.2.0p0 ...
which bundler	$HOME/.rbenv/shims/bundler

Back to get_tweets

Head back to the get_tweets directory
If you're using rbenv, type rbenv local 2.2.0
- This ensures that this particular app will always use 2.2.0...
- ... even if you switch to a new version globally.
- The idea is that rbenv allows you to accurately recreate the environment under which a ruby app was developed.
- Bundler enables the same thing. We'll use it next.

Gemfile

We will use Bundler to keep track of the gems this application uses
Create a file called Gemfile in your directory. Edit it to contain:


source 'https://rubygems.org'
gem 'simple_oauth'
gem 'typhoeus'

Now install these gems by typing: bundle install
This will create a file called Gemfile.lock that records the specific versions of the gems that you installed.

Gemfile.lock

On my machine, Gemfile.lock contains:


GEM
  remote: https://rubygems.org/
  specs:
    ethon (0.7.2)
      ffi (>= 1.3.0)
    ffi (1.9.6)
    simple_oauth (0.3.1)
    typhoeus (0.7.1)
      ethon (>= 0.7.1)

PLATFORMS
  ruby

DEPENDENCIES
  simple_oauth
  typhoeus

Timeline.rb

Create a file called Timeline.rb
This file will define a class called Timeline
Timeline will be responsible for:
- making a single API request on a user's timeline
- handling an error if it should occur
- on success, it will yield tweets for processing
We will develop this class in stages

Initial Class


require 'bundler/setup'

require 'json'
require 'simple_oauth'
require 'typhoeus'

class Timeline

  def initialize(args)
    @parser      = URI::Parser.new
    @props       = args.fetch(:props)
    @screen_name = args.fetch(:screen_name)
    @url         = url
  end

  def url
    'https://api.twitter.com/1.1/statuses/user_timeline.json'
  end

  private :url

end

get_tweets.rb

Let's also create a file called get_tweets.rb
This file will be our main program
It will do the following:
- Load up our oauth properties file
- Read the desired screen_name from the command line
- Instantiate the Timeline class
- Invoke the request for the first page of tweets
- Store any returned tweets in a file called tweets.json

Initial get_tweets.rb


require_relative 'Timeline'

if __FILE__ == $0

  STDOUT.sync = true

  if ARGV.length != 1
    puts "Usage: ruby get_tweets.rb <screen_name>"
    exit(1)
  end

  screen_name = ARGV[0]

  # load oauth.properties
  # instantiate Timeline
  # make the request and store tweets in tweets.json

end

Discussion

require is how ruby imports gems
We require bundler/setup to ensure that we're using the correct version of the gems we depend on
require_relative is how we import other files
if __FILE__ == $0 determines if we've been launched from the command line
STDOUT.sync = true asks for non-buffered output
ARGV contains the command line arguments that appear after ruby get_tweets.rb

Load the Tokens

Let's handle oauth.properties first
We need the tokens in a Hash


def load_props
  input = File.open('oauth.properties')
  JSON.parse(input.read)
end

def convert_props(input)
  props = {}
  input.keys.each do |key|
    props[key.to_sym] = input[key]
  end
  props
end

In the main body


props = convert_props(load_props)

Instantiate Timeline

We now have enough information to instantiate the Timeline class


args = {props: props, screen_name: screen_name}
twitter = Timeline.new(args)

Of course, Timeline doesn't do anything yet! Let's fix that!

Utilities (1)

Timeline needs a number of utility functions
Since we will need to pass a screen_name as part of the URL, we need to make sure it is properly escaped.
The simple_oauth gem provides a URI Parser that will do the trick


require 'simple_oauth'
parser = URI::Parser.new
parser.escape("kenbod,twitter", /[^a-z0-9\-\.\_\~]/i)
# outputs 'kenbod%2Ctwitter'

prepare()

Given this info, we add a prepare method to Timeline


def prepare(param)
  @parser.escape(param.to_s, /[^a-z0-9\-\.\_\~]/i)
end

We will use this function to prepare any params that get passed with our URL
For example:
- https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=kenbod

Utilities (2)

Twitter requires that we sign every request with an Authorization header
This is really hard to do; fortunately simple_oauth makes it... simple!
To sign the request, we have to pass the method, url, params, and our tokens
We will create a method called authorization to generate this header

authorization


def authorization
  params = {
    'screen_name' => prepare(@screen_name),
    'count' => prepare('200') }
  header = SimpleOAuth::Header.new("GET", @url, params, @props)
  { 'Authorization' => header.to_s }
end

private :authorization, :prepare, :url

Note: all utility methods are marked as private

Utilities (3)

Our second-to-last utility helps us configure the Typhoeus request
We will call it options; it will set the HTTP method, our headers, and our params


def options
  options = {}
  options[:method]  = :get
  options[:headers] = authorization
  options[:params]  = { screen_name: @screen_name, count: 200 }
  options
end

Utilities (4)

Our last utility makes a request and returns the response
We will call it make_request


def make_request
  request = Typhoeus::Request.new(@url, options)
  request.run
end

Collecting the Tweets

Now, we just have to go and collect the tweets


def collect
  puts "REQUESTING     : #{Time.now}"
  response = make_request
  if response.code == 200
    puts "SUCCESS      : #{Time.now}"
    tweets = JSON.parse(response.body)
    puts "#{tweets.size} tweet(s) received."
    yield tweets
  else
    puts "FAILURE      : #{Time.now}"
    puts "Response Code: #{response.code}"
    puts "Response Info: #{response.status_message}"
    exit(1)
  end
end

Updating get_tweets

Now that Timeline can actually do something, we have to update get_tweets to take advantage of it!


puts "Collecting 200 most recent tweets for '#{screen_name}'"

twitter.collect do |tweets|
  File.open('tweets.json', 'w') do |f|
    tweets.each do |tweet|
      f.puts "#{tweet.to_json}\n"
    end
  end
end

puts "DONE."

Run it!

Let's get 200 of the Bad Astronomer's tweets


$ ruby get_tweets.rb badastronomer
Collecting 200 most recent tweets for 'badastronomer'
REQUESTING   : 2015-02-11 22:42:03 -0700
SUCCESS      : 2015-02-11 22:42:05 -0700
200 tweet(s) received.
DONE.

The output is pretty dense: big JSON objects, one per line!

jq to the rescue!

To work with the tweets.json file, make use of jq
To pretty print the file:
- jq '.' tweets.json
jq is more than just a pretty printer. Try:
- jq '.text' tweets.json
Fun!

Wrapping Up

We've now got code to get one page of tweets
Next week, we will update this code to get all 3200 tweets
- To do that, we will need to incorporate information from Working With Timelines into the code
Also next week, working with the Streaming API
After that: NoSQL!

Getting Data from TwitterTake Two

Lecture 10 — Data Engineering — Spring 2015

Consumer Keys, Access Tokens

Consumer Keys, Access Tokens

Consumer Keys, Access Tokens

oauth.properties

Get Tweets

Prerequisites

rbenv set-up (optional)

Verify Installation (optional)

Back to get_tweets

Gemfile

Gemfile.lock

Timeline.rb

Initial Class

get_tweets.rb

Initial get_tweets.rb

Discussion

Load the Tokens

Instantiate Timeline

Utilities (1)

prepare()

Utilities (2)

authorization

Utilities (3)

Utilities (4)

Collecting the Tweets

Updating get_tweets

Run it!

jq to the rescue!

Wrapping Up

Getting Data from Twitter
Take Two