Twitter
Data Collection Framework

Lecture 11 — Data Engineering — Spring 2015

February 17, 2015

Time to Collect Some Data!

  • Last week, we reviewed the start of a framework for collecting Twitter Data
  • This weekend, I fleshed out the framework and it is now able to collect (lots of) data
    • For example, while testing my framework's ability to work with the Search API, I collected 100,681 tweets consuming 364MB of disk space.

Highlights of the Framework

  • Example implementations of five REST API endpoints and two Streaming API endpoints
  • Automatically handles Twitter Rate Limits
    • If you call an endpoint too many times, it automatically sleeps and waits until you can call it again
  • Provides implementations of the techniques described in Working with Timelines and Working with Cursors

Understanding the Framework

  • In lecture 10, we reviewed the code that was needed to make a single GET request on a single REST API endpoint
    • Property File with Consumer Key and Access Token
    • URI Escape HTTP Request Parameters
    • Use of SimpleOAuth to create an Authorization Header
    • Use of Typhoeus to create a request and send it

Building on that Knowledge

  • Those elements are still present
    • but have moved around a bit (in some cases)
    • subclasses have the ability to override the default behaviors in various ways

High-Level View of Framework

Class Hierarchy

TwitterRequest

  • Standardized Constructor
    • Expects an args hash with up to three entries
      • params: the parameters that go on the request
      • data: the data needed by the request
        • data MUST contain a :props key that points at the location of the properties file
      • log: an instance of a Ruby logger
        • If not supplied, a default one is used
  • Sets the contract required by all sub-classes

Contract?

  • How are contracts enforced?
  • In a statically-typed language like Java, we would use interfaces
  • In a dynamically-typed language like ruby, we improvise!
    
    def url
      raise NotImplementedError, "No URL Specified for the Request"
    end
                
  • If a subclass fails to implement one of these methods: BOOM!

The Contract (1)

  • A TwitterRequest has a public collect method that yields collected data back to its caller
  • Subclasses must provide implementations of:
    • url: the URL of the Twitter endpoint they access
    • request_name: the name used in the log for this request
    • twitter_endpoint: the endpoint used to look up rate limits for this request
    • success: a method that handles the response.body of a successful request

The Contract (2)

  • Subclasses may provide implementations of:
    • error: a method that handles failed requests; the default prints out information about the failure and then throws a runtime exception
    • authorization, options, make_request, and collect can also be overridden but they must preserve the semantics of the default operations

Helpers

  • Params and Props
  • Logging
  • Rates

Params and Props

  • These helpers are largely unchanged from last week's example
  • Only two new features in the Params helper
    • The ability to control if a parameter is included in a request
    • The ability to display the parameters that are being sent with a request

def include_param?(param)
  return true
end

def display_params
  result = []
  escaped_params.keys.each do |key|
    result << "#{key}=#{escaped_params[key]}" if include_param?(key)
  end
  result.join("&")
end
          

Logging (1)

  • The logging helper consists of a custom class definition and a method to create a default logger

require 'logger'

module TwitterLog

  class CustomFormatter < Logger::Formatter
    def call(severity, time, progname, msg)
    "#{time}: #{progname}: #{msg2str(msg)}\n"
    end
  end

  def default_logger
    logger = Logger.new($stdout)
    logger.progname = request_name;
    logger.formatter = CustomFormatter.new
    logger
  end

  private :default_logger

end
          

Logging (2)

  • The logger is created in TwitterRequest's constructor
    
    @log = args[:log] || default_logger
                
  • The logger is then accessed via the log attribute
    
    log.info("REQUESTING: #{request.base_url}?#{display_params}")
                

Rates (1)

  • The Rates helper allows the framework to automatically keep track of rate limits for a given Twitter endpoint
  • If you run out of invocations, it will block your next call and automatically sleep until the current Twitter window is over
  • The default make_request ensures that rates are checked on each request
    
    def make_request
      check_rates
      request = Typhoeus::Request.new(url, options)
      log.info("REQUESTING: #{request.base_url}?#{display_params}")
      response = request.run
      @rate_count = @rate_count - 1
      response
    end
                

Rates (2)

  • The Rates helper invokes a Twitter endpoint to get the application's current set of rate limits
  • These rates are stored in a class variable so that they are shared across all TwitterRequest instances created by an application
  • These global rates are only refreshed when needed
    
    def check_rates
      refresh_rates if @@rates.size == 0
      refresh_rates if Time.now > twitter_window
      return if @rate_count > 0
      delta = twitter_window - Time.now
      log.info "Sleeping for #{delta} seconds"
      sleep delta
      refresh_rates
    end
                

Types of Requests

  • The frameworks features three Request sub-classes
    • MaxIdRequest
    • CursorRequest
    • StreamingRequest
  • One request, RateLimits, is also a direct subclass of TwitterRequest but it's not a new type of request, it just customizes TwitterRequest to hit the rate_limit_status endpoint.

MaxIdRequest (1)

  • A subclass for endpoints that need to traverse timelines with the max_id parameter
  • It defines a new contract consisting of:
    • init_condition: set-up a variable to track our progress traversing the timeline
    • condition: check the variable to see if we should continue traversing the timeline
    • update_condition: update the variable after a request has been made and progress on the timeline has occurred

MaxIdRequest (2)

  • This subclass then overrides collect to enforce the new contract:
    
    def collect
      init_condition
      while condition
        response = make_request
        if response.code == 200
          success(response) do |tweets|
            yield tweets
            params[:max_id] = (tweets[-1]['id'] - 1) if tweets.size > 0
            update_condition(tweets)
          end
        else
          error(response)
        end
      end
    end
                

MaxIdRequest (3)

  • MaxIdRequest takes advantage of the include_param? contract
    • It only includes max_id in the request if it has a non-zero value

def include_param?(param)
  if param == :max_id
    if params[param] == 0
      return false
    end
  end
  return true
end
          

Using MaxIdRequest

  • Timeline uses MaxIdRequest to get tweets from a user's timeline
  • Search uses MaxIdRequest to get tweets from the Twitter Search API

Timeline


class Timeline < MaxIdRequest

  def initialize(args)
    super args
    params[:count] = 200
    @count = 0
  end

  def request_name
    "Timeline"
  end

  def twitter_endpoint
    "/statuses/user_timeline"
  end

  def url
    'https://api.twitter.com/1.1/statuses/user_timeline.json'
  end

  def success(response)
    log.info("SUCCESS")
    tweets = JSON.parse(response.body)
    @count += tweets.size
    log.info("#{tweets.size} tweet(s) received.")
    log.info("#{@count} total tweet(s) received.")
    yield tweets
  end

  def init_condition
    @num_success = 0
  end

  def condition
    @num_success < 16
  end

  def update_condition(tweets)
    if tweets.size > 0
      @num_success += 1
    else
      @num_success = 16
    end
  end

end
          

Search


class Search < MaxIdRequest

  def initialize(args)
    super args
    params[:count] = 100
    @count = 0
  end

  def request_name
    "Search"
  end

  def twitter_endpoint
    "/search/tweets"
  end

  def url
    'https://api.twitter.com/1.1/search/tweets.json'
  end

  def success(response)
    log.info("SUCCESS")
    tweets = JSON.parse(response.body)['statuses']
    @count += tweets.size
    log.info("#{tweets.size} tweet(s) received.")
    log.info("#{@count} total tweet(s) received.")
    yield tweets
  end

  def init_condition
    @last_count = 1
  end

  def condition
    @last_count > 0
  end

  def update_condition(tweets)
    @last_count = tweets.size
  end

end
          

CursorRequest

  • CursorRequest is similar to MaxIdRequest
  • However, it does not need to define a contract for subclasses.
  • It can implement all of the required functionality directly

class CursorRequest < TwitterRequest

  def initialize(args)
    super args
    params[:cursor] = -1
  end

  def include_param?(param)
    if param == :cursor
      if params[param] == -1
        return false
      end
    end
    return true
  end

  def collect
    while params[:cursor] != 0
      response = make_request
      if response.code == 200
        success(response) do |tweets|
          yield tweets
          params[:cursor] = JSON.parse(response.body)['next_cursor']
        end
      else
        error(response)
      end
    end
  end

end
          

StreamingTwitterRequest (1)

  • StreamingTwitterRequest is different in that its collect method is designed to run forever
  • We use Typhoeus differently to do a streaming request
    • We create a request and then define a series of event handlers on the request.
    • These handlers get called when appropriate as data streams in.
  • The handlers are:
    • on_headers: The response has started to stream back to us; we can check the headers to make sure everything is okay
    • on_body: Some data has been received from the server; we need to process it
    • on_complete: The response has finished; this can happen if the server decides to terminate the connection

StreamingTwitterRequest (2)

  • The client can also request a shutdown by calling (strangely enough) the request_shutdown method on the request object.
  • We use this method to terminate the connection cleanly if the user decides to Ctrl-C the client.

StreamingTwitterRequest (3)

  • The tricky part of this class is implementing the on_body event handler
  • Each time it gets called, we have received some data from the server; that data may have one or more JSON objects hiding in it
    • The underlying networking layers buffer input from the server and then hand us data when the buffer gets full
    • There is no guarantee that the data from the buffer ends directly on a message boundary
  • So, when we receive data, we append it to the current buffer
    • and then we process the current buffer, looking for message boundaries
    • With Twitter's Streaming API, the message boundary is determined by \r\n (the HTTP line terminator)

StreamingTwitterRequest (3)


class StreamingTwitterRequest < TwitterRequest

  def initialize(args)
    super(args)
    @buffer   = ""
    @continue = true
  end

  def request_shutdown
    @continue = false
  end

  def process
    index = @buffer.index("\r\n")
    while index
      yield @buffer[0..index-1]
      @buffer = @buffer[index+2..@buffer.length]
      index = @buffer.index("\r\n")
    end
  end

  def collect
    @request = Typhoeus::Request.new(url, options)
    @request.on_headers do |response|
      if response.code != 200
        error(response)
      end
    end
    @request.on_body do |data|
      @buffer = @buffer + data
      process do |msg|
        begin
          yield JSON.parse(msg)
        rescue JSON::ParserError
          # ignore empty message
        end
      end
      Thread.current.exit unless @continue
    end
    @request.on_complete do |response|
      request_shutdown
    end
    @request.run
  end

end
          

Examples

    Let's see the code in action

Homework 3 (1)

  • For Homework 3, you will add a new request to the framework
  • You will do this by
    • forking the repo
    • working to integrate a new request into the framework
    • testing your request by creating a command-line app that uses it
    • Once you are done, you will submit a pull request on my original repo
    • I will review your pull request and determine if I will integrate it into the official repository

Homework 3 (2)

Coming Up Next

An introduction to NoSQL!