Gowalla.com

Gowalla Engineering

Scraping Wikipedia

Written by Brad Fults

Here at Gowalla we use Campfire extensively for internal communication. We’ve always had some sort of chat robot hanging out in our rooms with us, providing simple abilities, but recently GitHub released Hubot and we moved our bot functionality over to Hubot scripts.

Can We Query Wikipedia?

Just the other day, our own Andy Ellwood asked if we could have a script that would query Wikipedia and return article summaries. Already looking for an opportunity to become more comfortable with CoffeeScript and Hubot, I said “Sure.”

So, I asked myself, does Wikipedia have an API? Well, it turns out that they do, but it’s not terribly helpful. For the task at hand—pulling a summary from any given article—the best their API can do is return the entire article as a huge chunk of malformed HTML. Not exactly ideal, but this is what we have tools for.

Scraping Article Pages

So I looked through GitHub’s repository of existing Hubot scripts from the community for ideas on how to proceed. I found web.coffee, which uses HTMLParser and Soupselect to parse arbitrary HTML into a workable DOM. This makes dealing with Wikipedia’s malformed documents somewhat easier.

So long as we’re going to parse all of the HTML in articles anyway, we may as well just load the full article pages instead of dealing with their complicated API:

makeArticleURL = (title) ->
  "http://en.wikipedia.org/wiki/#{encodeURIComponent(title)}"

Then, in order to get a summary for a given article, I noticed that each article has a series of <p> elements on a given page, the first one of which is usually the summary, so long as it contains enough text. First I made a utility method to give me arbitrary DOM objects from HTMLParser for ease of use:

parseHTML = (html, selector) ->
  handler = new HTMLParser.DefaultHandler((() ->),
    ignoreWhitespace: true
  )
  parser  = new HTMLParser.Parser handler
  parser.parseComplete html

  Select handler.dom, selector

Then, pulling all paragraphs on the page was simple:

paragraphs = parseHTML body, "p"

Finding a Summary

In order to find the right paragraph—a summary—I tried playing with data from many articles and came up with a simple heuristic that does quite well:

findBestParagraph = (paragraphs) ->
  return null if paragraphs.length is 0

  childs = _.flatten childrenOfType(paragraphs[0], 'text')
  text = (textNode.data for textNode in childs).join ''

  # remove parentheticals (even nested ones)
  text = text.replace(/\s*\([^()]*?\)/g, '').replace(/\s*\([^()]*?\)/g, '')
  text = text.replace /\s{2,}/g, ' '                # squash whitespace
  text = text.replace /\[[\d\s]+\]/g, ''            # remove citations
  text = _s.unescapeHTML text                       # get rid of nasties

  # if non-letters are the majority in the paragraph, skip it
  if text.replace(/[^a-zA-Z]/g, '').length < 35
    findBestParagraph paragraphs.slice(1)
  else
    text

Essentially, I collect all of the text under the element and strip out some undesirable bits, then look to see if the remainder has at least 35 letters in it. I can’t give a compelling explanation of why this works well, other than that I tried several methods and limits before landing on this one. Sometimes there are deceptively simple heuristics that can accomplish tasks that may seem difficult at first. It can’t hurt to just dive in with some real data and see what happens.

Find All Text Nodes in a DOM Tree

The only other bit of difficulty came when trying to extract the text from the DOM tree generated by HTMLParser. For that, I wrote a recursive method:

childrenOfType = (root, nodeType) ->
  return [root] if root?.type is nodeType

  if root?.children?.length > 0
    return (childrenOfType(child, nodeType) for child in root.children)

  []

When dealing with trees, recursion often works quite well. In this case, I just weed out text nodes from all elements in the DOM subtree rooted at root, returning empty array tombstones for everything else, which are later discarded with _.flatten.

The Final Product

The script in action

The final script can be found in GitHub’s official hubot-scripts repo. Hopefully the simple tactics in this script will help you put together your own interesting Hubot script, or just help improve your CoffeeScript skills. Please also let us know if you find ways to improve the script—we’re always looking to improve!

About Brad Fults

Brad Fults

Platform Developer. Background in JavaScript and front-end web development. Prefers simplicity, directness, red wine, fast cars and is healthily obsessed with quality.

Gowalla
Gowalla Passport
GitHub
Github
Twitter
Twitter

Other Articles