Main » Web Crawling

Web Crawling

I've been on several projects where web crawling and data processing were involved. After a few times rebuilding crawlers from scratch, here's my current solution. Cloud computing is an extremely effective method of crawling and scraping for data.

YAPC::Asia 2008 » Gungho & Cloud Computing, a Scalable Crawling & Processing Framework

I presented a talk at YAPC::Asia 2008 in Tokyo, Japan about my last crawling and data processing project. Using CPAN modules, Amazon's EC2 + S3, and MySQL, I crawled millions of pages over a short time frame. For the details, check out the docs below:

The input is a large list of starter URLs.

#!/usr/bin/perl

use strict;
use warnings;
use GunghoX::FollowLinks;
use Sample::Crawler;

crawl();

sub crawl {
   Sample::Crawler->run({
      provider   => { 
        module   => '+Sample::Crawler::Provider', 
        config   => { started => 1 } 
      },
      handler    => { module => '+Sample::Crawler::Handler' },
      engine     => { 
        module => 'POE', 
        config => { follow_redirects => 2 } 
      },
      user_agent => 'SampleBot/1.0 foo@contactme.com',
      components => [
        '+GunghoX::FollowLinks',
        'RobotRules',
      ],
      follow_links => {
          parsers => [
          { module => 'HTML',
            config => {
              merge_rule => 'ALL',
              rules      => [
              { module  => 'HTML::SelectedTags',
                config  => { tags  => [ qw(a link area) ] }
              },
              { module => '+Sample::FollowLinks::Rule::MIME',
                config => {
                  types => [qw(text/html)],
                  action => 'FOLLOW_ALLOW',
                  unknown => 'FOLLOW_ALLOW',
                }
              },
              { module => '+Sample::FollowLinks::Rule::SameHost',
                config => {  max_reqs_per_host => 20, } 
              },
              ]
            }
          }
        ]
      }
    });
}