Returning a list of anonymous proxies
Tags: Perl
Back in October of 2007, I started writing a little tool to build MFA 2.0 sites on the fly.
This tool (in Perl of course), allows me to create a new WordPress blog targeted to a very specific niche, populate the WordPress database with hundreds/thouands of articles that target that niche, and some other fancy things with lots of trickery under the hood. My Diabetes Information and Acne Skin Treatment websites are two examples of works I created in about 30 minutes with this tool back in October.
The article sites that I point to for content are attempting to drive traffic to their site and they implement all sorts of tricks on the server-side to try to thwart spidering and bots. They want “real humans†to read their content.
So I came up with the idea of using a random proxy server for each request. It slows down the speed with which I can spider articles, but it also doesn’t put me on an automatic block/ban list.
The problem with public proxy lists is that they become stale very quickly, so I needed a way to make sure every proxy I use is alive, valid and accepting connections to the remote site I’m querying for article content.
Enter my return_proxies() function in Perl, which does just this:
sub return_proxies { my $link = 'http://proxy-site/list.txt'; my $ua = LWP::UserAgent->new; my $rand_browser = random_browser(); $ua->agent($rand_browser); my $req = HTTP::Request->new(GET => $link) or die $!; my $res = $ua->request($req); my $status_line = $res->status_line; my $html = $res->content; my $t = HTML::TreeBuilder->new_from_content($html); my @output = map $_->as_HTML, $t->look_down(_tag => 'td', class => qr/dt-tb?/); my @proxies; foreach my $ip (@output) { (my $address) = $ip =~ /((?:\d+\.){3}\d+\:\d+)/; push @proxies, $address if $address; } # print Dumper(@proxies); return @proxies; }
I call this in my fetch_page() function like this:
my @proxies = return_proxies(); my $rand_proxy = "http://$proxies[rand @proxies]“; $ua->proxy(['http', 'ftp'], $rand_proxy);
So far it works very well, no issues at all that I’ve seen.
Obviously there’s a lot more to it than just this… but I can’t give away all of the secrets to my code, can I?