Skip to content

Instantly share code, notes, and snippets.

@jico
Created August 2, 2012 18:38
Show Gist options
  • Save jico/3239543 to your computer and use it in GitHub Desktop.
Save jico/3239543 to your computer and use it in GitHub Desktop.
# http://pad.squareup.com/457865
# Jico Baligod
# curl "http://pad.squareup.com/ep/pad/export/457865/latest?format=txt" > crawler.rb; ruby crawler.rb
# <a href='http://www.rubyinside.com/feed/'>RSS feed</a>
require 'net/http'
require 'uri'
class Crawler
attr_reader :url_list
def initialize
@url_list = []
end
def crawl(url)
urls = extract_urls(url)
queue = urls
while !queue.empty?
children = extract_urls(queue.shift)
children.each { |u| queue.push(u) unless queue.include? u }
queue.each { |u| @url_list << u unless @url_list.include? u }
break if @url_list.count >= 500
end
end
def extract_urls(url)
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
matches = response.body.scan(/<a .*href=["'](http[^s].*?)["']/)
return matches.flatten
end
end
crawler = Crawler.new
crawler.crawl('http://www.yahoo.com')
puts crawler.url_list
# a = []
# puts 'hello' if a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment