Lecture 11 — Data Engineering

Twitter
Data Collection Framework

Lecture 11 — Data Engineering — Spring 2015

February 17, 2015

Time to Collect Some Data!

Last week, we reviewed the start of a framework for collecting Twitter Data
This weekend, I fleshed out the framework and it is now able to collect (lots of) data
- For example, while testing my framework's ability to work with the Search API, I collected 100,681 tweets consuming 364MB of disk space.

Highlights of the Framework

Example implementations of five REST API endpoints and two Streaming API endpoints
Automatically handles Twitter Rate Limits
- If you call an endpoint too many times, it automatically sleeps and waits until you can call it again
Provides implementations of the techniques described in Working with Timelines and Working with Cursors

Understanding the Framework

In lecture 10, we reviewed the code that was needed to make a single GET request on a single REST API endpoint
- Property File with Consumer Key and Access Token
- URI Escape HTTP Request Parameters
- Use of SimpleOAuth to create an Authorization Header
- Use of Typhoeus to create a request and send it

Building on that Knowledge

Those elements are still present
- but have moved around a bit (in some cases)
- subclasses have the ability to override the default behaviors in various ways

High-Level View of Framework

Class Hierarchy

TwitterRequest

Standardized Constructor
- Expects an args hash with up to three entries
  - params: the parameters that go on the request
  - data: the data needed by the request
    - data MUST contain a :props key that points at the location of the properties file
  - log: an instance of a Ruby logger
    - If not supplied, a default one is used
Sets the contract required by all sub-classes

Contract?

How are contracts enforced?
In a statically-typed language like Java, we would use interfaces

In a dynamically-typed language like ruby, we improvise!


def url
  raise NotImplementedError, "No URL Specified for the Request"
end

If a subclass fails to implement one of these methods: BOOM!

The Contract (1)

A TwitterRequest has a public collect method that yields collected data back to its caller
Subclasses must provide implementations of:
- url: the URL of the Twitter endpoint they access
- request_name: the name used in the log for this request
- twitter_endpoint: the endpoint used to look up rate limits for this request
- success: a method that handles the response.body of a successful request

The Contract (2)

Subclasses may provide implementations of:
- error: a method that handles failed requests; the default prints out information about the failure and then throws a runtime exception
- authorization, options, make_request, and collect can also be overridden but they must preserve the semantics of the default operations

Helpers

Params and Props
Logging
Rates

Params and Props

These helpers are largely unchanged from last week's example
Only two new features in the Params helper
- The ability to control if a parameter is included in a request
- The ability to display the parameters that are being sent with a request


def include_param?(param)
  return true
end

def display_params
  result = []
  escaped_params.keys.each do |key|
    result << "#{key}=#{escaped_params[key]}" if include_param?(key)
  end
  result.join("&")
end

Logging (1)

The logging helper consists of a custom class definition and a method to create a default logger


require 'logger'

module TwitterLog

  class CustomFormatter < Logger::Formatter
    def call(severity, time, progname, msg)
    "#{time}: #{progname}: #{msg2str(msg)}\n"
    end
  end

  def default_logger
    logger = Logger.new($stdout)
    logger.progname = request_name;
    logger.formatter = CustomFormatter.new
    logger
  end

  private :default_logger

end

Logging (2)

The logger is created in TwitterRequest's constructor


@log = args[:log] || default_logger

The logger is then accessed via the log attribute


log.info("REQUESTING: #{request.base_url}?#{display_params}")

Rates (1)

The Rates helper allows the framework to automatically keep track of rate limits for a given Twitter endpoint
If you run out of invocations, it will block your next call and automatically sleep until the current Twitter window is over

The default make_request ensures that rates are checked on each request


def make_request
  check_rates
  request = Typhoeus::Request.new(url, options)
  log.info("REQUESTING: #{request.base_url}?#{display_params}")
  response = request.run
  @rate_count = @rate_count - 1
  response
end

Rates (2)

The Rates helper invokes a Twitter endpoint to get the application's current set of rate limits
These rates are stored in a class variable so that they are shared across all TwitterRequest instances created by an application

These global rates are only refreshed when needed


def check_rates
  refresh_rates if @@rates.size == 0
  refresh_rates if Time.now > twitter_window
  return if @rate_count > 0
  delta = twitter_window - Time.now
  log.info "Sleeping for #{delta} seconds"
  sleep delta
  refresh_rates
end

Types of Requests

The frameworks features three Request sub-classes
- MaxIdRequest
- CursorRequest
- StreamingRequest
One request, RateLimits, is also a direct subclass of TwitterRequest but it's not a new type of request, it just customizes TwitterRequest to hit the rate_limit_status endpoint.

MaxIdRequest (1)

A subclass for endpoints that need to traverse timelines with the max_id parameter
It defines a new contract consisting of:
- init_condition: set-up a variable to track our progress traversing the timeline
- condition: check the variable to see if we should continue traversing the timeline
- update_condition: update the variable after a request has been made and progress on the timeline has occurred

MaxIdRequest (2)

This subclass then overrides collect to enforce the new contract:


def collect
  init_condition
  while condition
    response = make_request
    if response.code == 200
      success(response) do |tweets|
        yield tweets
        params[:max_id] = (tweets[-1]['id'] - 1) if tweets.size > 0
        update_condition(tweets)
      end
    else
      error(response)
    end
  end
end

MaxIdRequest (3)

MaxIdRequest takes advantage of the include_param? contract
- It only includes max_id in the request if it has a non-zero value


def include_param?(param)
  if param == :max_id
    if params[param] == 0
      return false
    end
  end
  return true
end

Using MaxIdRequest

Timeline uses MaxIdRequest to get tweets from a user's timeline
Search uses MaxIdRequest to get tweets from the Twitter Search API

Timeline


class Timeline < MaxIdRequest

  def initialize(args)
    super args
    params[:count] = 200
    @count = 0
  end

  def request_name
    "Timeline"
  end

  def twitter_endpoint
    "/statuses/user_timeline"
  end

  def url
    'https://api.twitter.com/1.1/statuses/user_timeline.json'
  end

  def success(response)
    log.info("SUCCESS")
    tweets = JSON.parse(response.body)
    @count += tweets.size
    log.info("#{tweets.size} tweet(s) received.")
    log.info("#{@count} total tweet(s) received.")
    yield tweets
  end

  def init_condition
    @num_success = 0
  end

  def condition
    @num_success < 16
  end

  def update_condition(tweets)
    if tweets.size > 0
      @num_success += 1
    else
      @num_success = 16
    end
  end

end

Search


class Search < MaxIdRequest

  def initialize(args)
    super args
    params[:count] = 100
    @count = 0
  end

  def request_name
    "Search"
  end

  def twitter_endpoint
    "/search/tweets"
  end

  def url
    'https://api.twitter.com/1.1/search/tweets.json'
  end

  def success(response)
    log.info("SUCCESS")
    tweets = JSON.parse(response.body)['statuses']
    @count += tweets.size
    log.info("#{tweets.size} tweet(s) received.")
    log.info("#{@count} total tweet(s) received.")
    yield tweets
  end

  def init_condition
    @last_count = 1
  end

  def condition
    @last_count > 0
  end

  def update_condition(tweets)
    @last_count = tweets.size
  end

end

CursorRequest

CursorRequest is similar to MaxIdRequest
However, it does not need to define a contract for subclasses.
It can implement all of the required functionality directly


class CursorRequest < TwitterRequest

  def initialize(args)
    super args
    params[:cursor] = -1
  end

  def include_param?(param)
    if param == :cursor
      if params[param] == -1
        return false
      end
    end
    return true
  end

  def collect
    while params[:cursor] != 0
      response = make_request
      if response.code == 200
        success(response) do |tweets|
          yield tweets
          params[:cursor] = JSON.parse(response.body)['next_cursor']
        end
      else
        error(response)
      end
    end
  end

end

StreamingTwitterRequest (1)

StreamingTwitterRequest is different in that its collect method is designed to run forever
We use Typhoeus differently to do a streaming request
- We create a request and then define a series of event handlers on the request.
- These handlers get called when appropriate as data streams in.
The handlers are:
- on_headers: The response has started to stream back to us; we can check the headers to make sure everything is okay
- on_body: Some data has been received from the server; we need to process it
- on_complete: The response has finished; this can happen if the server decides to terminate the connection

StreamingTwitterRequest (2)

The client can also request a shutdown by calling (strangely enough) the request_shutdown method on the request object.
We use this method to terminate the connection cleanly if the user decides to Ctrl-C the client.

StreamingTwitterRequest (3)

The tricky part of this class is implementing the on_body event handler
Each time it gets called, we have received some data from the server; that data may have one or more JSON objects hiding in it
- The underlying networking layers buffer input from the server and then hand us data when the buffer gets full
- There is no guarantee that the data from the buffer ends directly on a message boundary
So, when we receive data, we append it to the current buffer
- and then we process the current buffer, looking for message boundaries
- With Twitter's Streaming API, the message boundary is determined by \r\n (the HTTP line terminator)

StreamingTwitterRequest (3)


class StreamingTwitterRequest < TwitterRequest

  def initialize(args)
    super(args)
    @buffer   = ""
    @continue = true
  end

  def request_shutdown
    @continue = false
  end

  def process
    index = @buffer.index("\r\n")
    while index
      yield @buffer[0..index-1]
      @buffer = @buffer[index+2..@buffer.length]
      index = @buffer.index("\r\n")
    end
  end

  def collect
    @request = Typhoeus::Request.new(url, options)
    @request.on_headers do |response|
      if response.code != 200
        error(response)
      end
    end
    @request.on_body do |data|
      @buffer = @buffer + data
      process do |msg|
        begin
          yield JSON.parse(msg)
        rescue JSON::ParserError
          # ignore empty message
        end
      end
      Thread.current.exit unless @continue
    end
    @request.on_complete do |response|
      request_shutdown
    end
    @request.run
  end

end

Examples

Let's see the code in action

Homework 3 (1)

For Homework 3, you will add a new request to the framework
You will do this by
- forking the repo
- working to integrate a new request into the framework
- testing your request by creating a command-line app that uses it
- Once you are done, you will submit a pull request on my original repo
- I will review your pull request and determine if I will integrate it into the official repository

Homework 3 (2)

What new requests?
Any request that is currently unimplemented from the list that is shown on the left side of this page:
- https://dev.twitter.com/rest/public

Coming Up Next

An introduction to NoSQL!

TwitterData Collection Framework

Lecture 11 — Data Engineering — Spring 2015

Time to Collect Some Data!

Highlights of the Framework

Understanding the Framework

Building on that Knowledge

High-Level View of Framework

Class Hierarchy

TwitterRequest

Contract?

The Contract (1)

The Contract (2)

Helpers

Params and Props

Logging (1)

Logging (2)

Rates (1)

Rates (2)

Types of Requests

MaxIdRequest (1)

MaxIdRequest (2)

MaxIdRequest (3)

Using MaxIdRequest

Timeline

Search

CursorRequest

StreamingTwitterRequest (1)

StreamingTwitterRequest (2)

StreamingTwitterRequest (3)

StreamingTwitterRequest (3)

Examples

Homework 3 (1)

Homework 3 (2)

Coming Up Next

Twitter
Data Collection Framework