I've been on several projects where web crawling and data processing were involved. After a few times rebuilding crawlers from scratch, here's my current solution. Cloud computing is an extremely effective method of crawling and scraping for data.
I presented a talk at YAPC::Asia 2008 in Tokyo, Japan about my last crawling and data processing project. Using CPAN modules, Amazon's EC2 + S3, and MySQL, I crawled millions of pages over a short time frame. For the details, check out the docs below:
The input is a large list of starter URLs.
#!/usr/bin/perl
use strict;
use warnings;
use GunghoX::FollowLinks;
use Sample::Crawler;
crawl();
sub crawl {
Sample::Crawler->run({
provider => {
module => '+Sample::Crawler::Provider',
config => { started => 1 }
},
handler => { module => '+Sample::Crawler::Handler' },
engine => {
module => 'POE',
config => { follow_redirects => 2 }
},
user_agent => 'SampleBot/1.0 foo@contactme.com',
components => [
'+GunghoX::FollowLinks',
'RobotRules',
],
follow_links => {
parsers => [
{ module => 'HTML',
config => {
merge_rule => 'ALL',
rules => [
{ module => 'HTML::SelectedTags',
config => { tags => [ qw(a link area) ] }
},
{ module => '+Sample::FollowLinks::Rule::MIME',
config => {
types => [qw(text/html)],
action => 'FOLLOW_ALLOW',
unknown => 'FOLLOW_ALLOW',
}
},
{ module => '+Sample::FollowLinks::Rule::SameHost',
config => { max_reqs_per_host => 20, }
},
]
}
}
]
}
});
}