HOWTO: Configure Tor + SASL + irc to connect to Freenode

Tags: , , , , , , , , ,

I fought this problem on the train into the city today, because my MiFi‘s hostname was not correctly reversing to it’s given IP (verified by dig) and Freenode was denying the connection; it looked like this:

Mar 22 06:51:41 *       Looking up irc.freenode.net
Mar 22 06:51:41 *       Connecting to chat.freenode.net (86.65.39.15) port 6667...
Mar 22 06:51:42 *       Connected. Now logging in...
Mar 22 06:51:42 *       *** Looking up your hostname...
Mar 22 06:51:42 *       *** Checking Ident
Mar 22 06:51:42 *       *** Your forward and reverse DNS do not match, ignoring hostname
Mar 22 06:51:55 *       *** No Ident response
Mar 22 06:51:55 *       *** Notice -- You need to identify via SASL to use this server
Mar 22 06:51:55 *       Closing Link: 166.199.4.113 (SASL access only)
Mar 22 06:51:55 *       Disconnected (Remote host closed socket).
Mar 22 06:52:05 Cycling to next server in Freenode...
Mar 22 06:52:05 *       Disconnected ().

I wanted to connect, to talk to the folks in #linux, and ask them about another question I had (see newer blog post about fullscreen VMware session for that). This was yet another example of the kind of Yak Shaving I deal with on a daily basis.

At first, I tried installing a few identd daemons, then some of the spoofing identd daemons, then purged them all and decided to try identifying using SASL like it suggested.

I did a few seconds of Google’ing and found a helpful website with a SASL plugin in C. I compiled that, installed it into /usr/lib/xchat/plugins, restarted XChat, and attempted to authenticate and identify using this plugin and the instructions.

If the site goes down, I have local copies of the files you need, just email me.

You’ll need to create a file called cap_sasl.conf and put it in ~/.xchat2/, which includes the following syntax:

/sasl [nickname] [password] FreeNode

So if your nickname (username on Freenode) was ‘foobar‘ and your password was “MyS3cretPas5word“, you’d put the following in that file:

/sasl foobar MyS3cretPas5word FreeNode

If you compiled this correctly and put it in the right place, you can also just issue a simple /help sasl command to get the syntax:

Usage: SASL <login> <password> <network>, enable SASL authentication for given network

When you load up XChat, you should see something like this in the main window (if the plugin works):

 Python interface loaded
 Display amarok loaded, type "/disrok help" for a command list
 Perl interface loaded
 Tcl plugin for XChat - Version 1.63 
 Copyright 2002-2005 Daniel P. Stasinski
 http://www.scriptkitties.com/tclplugin/
 Tcl interface loaded
 Loading cap_sasl.conf
 Enabled SASL authentication for FreeNode
 cap_sasl plugin 0.0.4 loaded

The last two lines are what you’re looking for. Now typing “/sasl” will show you the following:

 foobar:MyS3cretPas5word at FreeNode

This too, failed to authenticate me and validate my (incorrect) reverse DNS problem. What I saw was this:

Mar 22 20:24:02 *       Looking up irc.freenode.net
Mar 22 20:24:05 *       Connecting to chat.freenode.net (140.211.167.98) port 6667...
Mar 22 20:24:05 *       Connected. Now logging in...
Mar 22 20:24:05 *       *** Looking up your hostname...
Mar 22 20:24:05 *       *** Checking Ident
Mar 22 20:24:06 *       *** Couldn't look up your hostname
Mar 22 20:24:19 *       *** No Ident response
Mar 22 20:24:52 *       Closing Link: 32.138.186.102 (Connection timed out)
Mar 22 20:24:52 *       Disconnected (Remote host closed socket).
Mar 22 20:25:02 Cycling to next server in Freenode...

I decided to investigate a different solution: Tor!

Read the rest of this entry »

Quick-n-Dirty Math from the Shell

Tags:

Casio fx-115ES CalculatorI find myself needing to whip out a calculator a dozen times a day, but my calculator is never at my fingertips when I need it and the “Calc” application on my BlackBerry is too clunky to use with the trackball. I needed a faster way to do some quick math when I’m around the laptops I use every day.

Did I mention I’m horrible at math? Yes, it’s true. I’ve been working with computers for the last 2 decades, and I still struggle with some intermediate and complex math concepts. But that’s why we have computers and calculators, right?

I’m used to using bc(1) for all of those quick and dirty needs, but it requires that I load it up in an interactive fashion. It looks like this:

bc in Interactive Mode

$ bc
bc 1.06.94
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
scale=3
19931424 / 1024
19464.281
quit

I have to type each line one at a time. It’s effective, but clunky because it is manual and interactive. So I started looking around for other ways to do this, that weren’t so complex and could be done non-interactively.

Read the rest of this entry »

Snapshot backups of EVERYTHING using rsync (including Windows!)

Tags: , , , , , , , ,

Just a bunch of disksLet me just start by saying that I have a lot of data. In multiple places. Some on laptops, some on servers, some on removable drives and mirrored hard disks sitting in a bank vault (yes, really). Lots of data on lots of systems in different states and locations: client data, personal data, work data, community data and lots more.

Over the years, I’ve tried my best to unify where that data is sourced from, for example by relocating the standard “My Documents” location on all of my Windows machines (physical and virtual), to point to a Samba share that is served up by a GELI-encrypted volume on my FreeBSD or Linux servers. That part works well, so far, but that’s only a small piece of the larger puzzle.

Over the last decade, the amount of data I’m holding and responsible for managing has grown significantly, and I needed a better way to manage it all.

There are plenty of backup solutions for Linux including the popular Amanda and Bacula, but I needed something more portable, leaner and much more efficient. That quest led me to look to find Unison mostly due to it’s cross-platform support, but it was still a bit more complicated than I needed.

So I kept looking and eventually found rsnapshot, a Perl-based tool wrapped around the standard rsync utility written by Andrew Tridgell.

Since I’d already been using rsync quite a bit over the last 10 years or so to copy data around as I needed it and to perform nightly full backups of my remote servers, I decided to look into using rsync to manage a new backup solution based around incremental backups as well as full backups.

I’m already using rsync to pull a couple of terabytes of mirrored data to my servers on a nightly basis. I’m mirroring CPAN, FreeBSD, Project Gutenberg, Cygwin, Wikipedia and several other key projects, so this was a natural graft onto my existing environment.

Read the rest of this entry »

Converting FLOSS Manuals to Plucker format

Tags: ,

I stumbled across a site called “FLOSS Manuals” recently, and thought that it would be a great place to create some new Plucker documents for our users, and distribute them. I’ve create hundreds of other Plucker documents for users in years past, so this was a natural progression of that.

You can (and should!) read all about their mission and see what they’re up to. Credit goes to Adam Hyde (the founder), Lotte Meijer (for the design), Aleksandar Erkalovic (developer) and “Mr Snow” for keeping the servers running, among many other contributors.

The FLOSS Manuals Foundation (Stichting FLOSS Manuals) creates free, open source documentation for free, open source software. FLOSS Manuals is a community of free documentation writers that publish free manuals about free software across multiple languages.

Free software can be freely run, studied, redistributed and improved without the restrictive and often expensive licensing systems attached to commercial proprietary software. Developers can adapt free software to their own needs, and can contribute to its ongoing communal development. FLOSS Manuals specifically document software that is free in this development sense and also in price. Free software projects are developed using established methodologies and tools, and sites like Savannah and Sourceforge support established social production models for developing free software. FLOSS Manuals provides the methodologies, tools, and social production models for developing documentation of free software.

By supporting quality, user-friendly documentation of Free, Libre, Open Source Software, FLOSS Manuals aims to encourage the use of this software, to support the technical and social revolution it enables.

If you want to support their cause (and I strongly recommend you do), you can visit their bookstore directly. (Note: I get absolutely nothing for hosting a link to their bookstore here; no affiliate links or commisions whatsoever).

When I quickly Googled around, I found someone was already doing exactly that, albeit in a one-off shell script.

I decided to take his work and build upon it, making it self-healing, and created what I call the “Plucker FLOSS Manuals Autobuilder v1.0” :)

This is all in Perl, clean, and is self-healing. When FLOSS Manuals updates their site with more content, this script will continue to be able to download, convert and build that new content for you… no twiddling necessary.

The code is well-commented, and should be clear and concise enough for you to be able to use it straight away.

Have fun!

#!/usr/bin/perl -w

use strict;
use warnings;
use diagnostics;
use LWP::UserAgent;
use HTML::SimpleLinkExtor;

my $flossurl    = 'http://en.flossmanuals.net';
my $ua          = 'Plucker FLOSS Manuals Autobuilder v1.0 [desrod@gnu-designs.com]';
my $top_extor   = HTML::SimpleLinkExtor->new();

# fetch the top-level page and extract the child pages
$top_extor->parse_url($flossurl, $ua);
my @links       = grep(m:^/:, $top_extor->a);
pop @links;     # get rid of '/news' item from @links; fragile

# Get the print-only page of each child page
get_printpages($flossurl . $_) for @links;

############################################################################# 
#
# Get the pages themselves, and return their content to the caller
#
#############################################################################
sub get_content {
        my $url = shift;

        my $ua          = 'Mozilla/5.0 (en-US; rv:1.4b) Gecko/20030514';
        my $browser     = LWP::UserAgent->new();
        $browser->agent($ua);
        my $response    = $browser->get($url);
        my $decoded     = $response->decoded_content;
 
        # This was necessary, because of a bug in ::SimpleLinkExtor,
        # otherwise this code would be 10 lines shorter. Sigh.
        if ($response->is_success) {
                return $decoded;
        }
}


############################################################################# 
#
# Fetch the print links from the child pages snarfed from the top-level page
#
#############################################################################
sub get_printpages {
        my $page = shift;

        my $sub_extor   = HTML::SimpleLinkExtor->new();
        $sub_extor->parse(get_content($page));

        # Single out only the /print links on each sub-page
        my @printlinks  = grep(m:^/.*/print$:, $sub_extor->a);

        my $url         = $flossurl . $printlinks[0];
        (my $title      = $printlinks[0]) =~ s,\/(\w+)\/print,$1,;

        # Build it with Plucker
        print "Building $title from $url\n";
        plucker_build($url, $title);
}


############################################################################# 
#
# Build the content with Plucker, using a "safe" system() call in list-mode
#
#############################################################################
sub plucker_build {
        my ($url, $title) = @_;

        my $workpath            = "/tmp/";
        my $pl_url              = $url;
        my $pl_bpp              = "8";   
        my $pl_compression      = "zlib";
        my $pl_title            = $title;
        my $pl_copyprevention   = "0";
        my $pl_no_url_info      = "0";
        my $pdb                 = $title;

        my $systemcmd   = "/usr/bin/plucker-build";

        my @systemargs  = (
                        '-p', $workpath, 
                        '-P', $workpath,
                        '-H', $pl_url,
                        $pl_bpp ? "--bpp=$pl_bpp" : (),
                        ($pl_compression ? "--${pl_compression}-compression" : ''),
                        '-N', $pl_title,
                        $pl_copyprevention ? $pl_copyprevention : (),
                        $pl_no_url_info ? $pl_no_url_info : (),
                        '-V1',
                        "--staybelow=$flossurl/floss/pub/$title/",
                        '--stayonhost',
                           '-f', "$pdb");

        system($systemcmd, @systemargs);
}

Yak-shaving with my Music and Media collection

Tags: ,

Gold iPod Shuffle
This particular bit of yak-shaving all started because one of the Amtrak LSA staff asked me if I could write a tool to print out his MP3 collection by Artist, Album and/or Year. This LSA (Lead Service Attendant; they manage the café car) works as a DJ in his off-hours, doing various gigs for weddings and other parties.

So I took 15 minutes while traveling to the office to whip up something in Perl that did just that, and dumped it to a plain text file which I could then reformat in OpenOffice.org and then export as a pretty PDF he could print and hole-punch into his DJ binder. Problem solved, and he was impressed that it only took 15 minutes to cook that up.

And that’s when it started. The yak-shaving.

“yak shaving is what you are doing when you’re doing some stupid, fiddly little task that bears no obvious relationship to what you’re supposed to be working on, but yet a chain of twelve causal relations links what you’re doing to the original meta-task.”

Here’s how it began:

While building that list of Artist/Album/Title/Year, I realized that some of my mp3 files were missing some pieces of information. Some had the years missing, some had the genre mixed up, some were missing the data altogether.

So I went in and started fixing that.

Then I realized that the album art I was storing as “folder.jpg” was missing in some directories, and each time I rebuild my music library via amaoK or iTunes or anything else, I have to go re-fetch those missing album covers from Amazon or other online places.

So I went in and started fixing that too.

To do that, I had to use a Windows tool called Tag Tuner. I’m not a Windows person by any means, but there really is nothing as slick as TagTuner in Linux (yet?). There is kid3, but it lacks some pretty powerful features (but adds its own, like the ability to remove headers from the mp3 files).

I started adding in all of the missing cover art, storing the album art as an actual image file within the APIC field of the ID3v2 MP3 header. Some of the album art required that I scan in the actual covers from the CDs I have that aren’t available anymore, or aren’t online. Some of it was Google’d up, and others were found in other places on the ‘net.

It was (and still is) an enormous task to make sure every piece of the mp3 metadata is correct, album art is intact (including bootlegs, bonus albums, NFR [not for resale] albums and others).

Then I decided to try to “enhance” the Perl script I wrote, by slapping a web front-end on it, so I could sort by Artist, Album, Year, Genre and so on, and export that to a nicely-formatted PDF file for “Shaggy” (the Amtrak LSA/DJ) or myself.

I started down the path of looking into the Apache::MP3 Perl module on CPAN, which looked promising. When I Google’d up some example code, I saw a reference in an obscure Ubuntu forum post that mentioned using an Apache2 module called mod_musicindex, which supersedes Apache::MP3.

I installed and configured that, and found that there were some discrepancies in the configuration, and that some of the values in the default stanzas indicated in several web references on setting up mod_musicindex all pointed to. They were all incorrect. Here’s what was suggested:

Alias /music/ "/Media/Music/mp3/"
<Directory "/Media/Music/mp3/">
    AuthType Basic
    AuthName "music"
    Require group music

    Options Indexes MultiViews FollowSymlinks
    AllowOverride Indexes
    MusicIndex On +Stream +Download +Search -Rss -Tarball
    MusicFields title artist length bitrate
    MusicPageTitle Media Library
    MusicDefaultCss musicindex.css
    MusicIndexCache file://tmp/music
    MusicDirPerLine 4
    MusicIceServer [ice.gnu-designs.com]:8000
    MusicCookieLife 300
</Directory>

The problem was that restarting Apache resulted in errors with some of those options. I found a small clue buried in the README for musicindex:

“The MusicIndex Option replaces altogether MusicLister, MusicAllowDownload, MusicAllowStream, MusicAllowSearch, and MusicAllowRss.”

Removing those options and replacing them with their new equivalents solved that problem.

Alias /music/ "/Media/Music/mp3/"
<Directory "/Media/Music/mp3/">
    AuthType Basic
    AuthName "music"
    Require group music

    Options Indexes MultiViews FollowSymlinks
    AllowOverride Indexes
    MusicIndex On +Stream +Download +Search -Rss -Tarball
    MusicSortOrder album disc track artist title length bitrate freq filetype filename uri
    MusicFields title artist album length bitrate
    MusicPageTitle Media Library
    MusicDefaultCss musicindex.css
    MusicIndexCache file://tmp/music
    MusicDirPerLine 4
    MusicIceServer [ice.gnu-designs.com]:8000
    MusicCookieLife 300
</Directory>

And that worked. But it was deathly slow to render a single directory of only a handful of music files. I tried to eek out more performance, but it was just too slow to be useful.

Then I found a reference in another forum thread of a replacement for mod_musicindex called “edna“, so I decided to download that and give it a try.

edna is a standalone Python script which listens on a port and can present your music collection in a very similar way to mod_musicindex, but it is VERY fast, and has quite a few additional features that mod_musicindex does not provide.

But… it’s Python, and I have a genetic distaste for anything written in that language. I played with it for quite awhile and walked around my music collection with it. One of the limitations of edna that I found (besides being written in Python) was that it required that album art be in a single, separate file stored in the same directory as the mp3 files. Since I painstakingly took the time to store each and every album cover in the mp3 files themselves, this was a no-go for me.

So I went back to mod_musicindex while I kept looking for alternatives. One of the quirks with mod_musicindex that I found, was its rendering of proper unicode characters. I jumped into the #apache IRC channel on Freenode to ask for some guidance with respect to “tricking” the right charset to be used (for example, Björk was showing up as B?ork) and one of the lurkers in #apache asked if I’d ever heard of “Ampache” before. I hadn’t, so I trundled over and installed a copy.

The installation was really clunky and challenging, and I had to go into the code at one point and gut out a check which was throwing an error, because it made assumptions about my Apache setup that were just not valid.

I installed that, configured it, added a “catalog” (what ampache calls a collection of your music) to begin navigating through the interface.

In doing so, I realized that there were still quite a few mp3 flies with the wrong ID3v2 metadata or missing/incorrect album covers.

So that’s where I am now. I’m using a combination of Ampache + TagTuner to go through my entire MP3 collection and “normalizing” all of the data in each file. It’s long and drawn out work, but ultimately beneficial, since I only have to do it once.

And when I get back on the train to NY again and “Shaggy” is working, I can show him this system and see if it would be useful for his own DJ rig or parties.

THAT, is yak shaving in the true sense and spirit of the term.

Submitting Tasks to Tracks via Email

Tags: , , ,

I’ve recently installed Tracks on my server to help me manage my myriad of tasks, projects and timelines utilizing the GTD methodology created by David Allen.

After installing it on my server I almost immediately noticed that Ruby and Rails was going to have a problem coexisting with my very tweaked Apache instances (running roughly 300 separate websites).

I found a project called Passenger, and immediately got that working with Tracks running on top of it. No fuss, no muss.

But one problem with Tracks that I’ve found (well, one of several, none of them showstoppers), is that you can’t interact with Tracks any other way other than the web interface. The interface is nice and clean and slick, but I don’t always have access to the web where I am.. and that limits the functionality of Tracks for me. I do, however… have access to some sort of email account everywhere I go.

So I whipped up a quick little script to allow me to send email to Tracks and have it post Tasks in the right contexts and on the right due dates for me.

Here’s the code for that:

#!/usr/bin/perl
#        _._   
#       /_ _`.      (c) 2008, David A. Desrosiers
#       (.(.)|      setuid at gmail.com
#       |\_/'|   
#       )____`\     Email to Tracks interface
#      //_V _\ \ 
#     ((  |  `(_)   If you find this useful, please drop me 
#    / \> '   / \   an email, or send me bug reports if you find
#    \  \.__./  /   problems with it. I accept PayPal too! =-)
#     `-'    `-' 
#
##############################################################################
# 
# License
#
##############################################################################
#
# This script is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation; either version 2 of the License, or (at your option)
# any later version.
# 
# This script is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
# more details.
# 
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc., 59
# Temple Place
#
# - Suite 330, Boston, MA 02111-1307, USA.
#
##############################################################################
# 
# Email format can be any of the following:
#
#       Subject: The main description of the Task
#
#       @Context
#       01/02/2003
#       This is my note text
#
# or:
# 
#       Subject: Default Task description
#
#       c: @context
#       d: 2008/12/31
#       n: The note text to insert
#
# Most valid date formats are accepted, and this will do its best to
# "correct" and normalize them. You can also prefix your lines with the
# modifiers above, or read the regexes below for more. For example: 
# c:, context:, cxt:, con:, ct: are all valid prefixes for "context"
# n: and note: are all valid prefixes for "notes", and so on.
#
##############################################################################
use strict;
use XML::Simple;
use LWP::UserAgent;
use URI::Escape;
use Email::Abstract;
use Date::Manip;
use Date::Parse;
use Data::Dumper; 

my $url                 = "http://your.tracks.site/todos.xml";
my $contextUrl          = "http://your.tracks.site/contexts.xml";

# The default contextid where you want the todo added
# SELECT id,name FROM contexts;
my $contextid           = "8";

my $user                = "yourname";
my $password            = "yourpass";

# Leave these tokens alone.  They are valid as of Tracks 1.5 RESTful API.
my %todo = map { +($_ => "todo[$_]") } qw(notes context_id description due);

# Get the context legend in order to match by name
my $ua                  = new LWP::UserAgent;
my $req                 = new HTTP::Request 'GET',$contextUrl;
$req->authorization_basic($user,$password);
my $res                 = $ua->request($req);
my $contexts            = XMLin($res->content);

# Split apart the email into Subject and Body
my $message             = do { local $/; <STDIN> };
my $email               = Email::Abstract->new($message);
my $subject             = $email->get_header("Subject");
my $body                = $email->get_body;

# These can probably be cleaned up a bit
my ($context_line)      = $body =~ /^(?:c:|ct:|cxt:|con:|context:|@)\s*(.+)$/mi;
my ($date_line)         = $body =~ /^(?:d:|date:)\s*(\d.*)$/m;
$date_line              = UnixDate(ParseDate("today"), "%g") if (length($date_line) == 0);
my $time                = str2time($date_line);
my $due_date            = UnixDate(scalar gmtime($time), "%m/%d/%Y");
my ($note_line)         = $body =~ /^(?:n:|note:)\s*(.*?)$/mi; 

# Concatenate the data here before we send POST to the Tracks server
my $post_data = 
        $todo{'context_id'} . "=" . $contextid . "&" . 
        $todo{'description'}. "=" . uri_escape($subject) . "&" . 
        $todo{'notes'} . "=" . uri_escape($note_line) . "&" . 
        $todo{'due'} . "=" . uri_escape($due_date);

# Use LWP to do the posting ($ua was created earlier)
$req = new HTTP::Request 'POST',$url;
$req->content_type('application/x-www-form-urlencoded');
$req->content($post_data);
$req->authorization_basic($user,$password);
$res = $ua->request($req);

You can send emails in the format in the comments above to your Tracks install and it will create Tasks for you. I could refactor those regexes a bit, and that’s a task on my list, but so far, this works…

Emails look like this:

Subject: Update regexes in Tracks perl email interface
From: "David A. Desrosiers" <desrod at gnu-designs.com>
To: 2e59561d76 at gnu-designs.com
Content-Type: text/plain
Message-Id: <1228182961.14783.123.camel at gnu-designs.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.22.3.1 
Content-Transfer-Encoding: 7bit
Date: Mon, 01 Dec 2008 20:56:02 -0500
X-Evolution-Format: text/plain

c: @Internet
d: 12/15/2008
n: Clean up the regexes that processes email-based Tracks tasks

Set it up in your MTA as an alias, as follows:

2e59561d76:                         "|/path/to/tracks-mail.pl"

Make sure you re-run newaliases(1) after adding this entry:

$ sudo /usr/sbin/newaliases
/etc/mail/aliases: 248 aliases, longest 82 bytes, 15141 bytes total

I tried to make it smart enough to figure out most common (valid please) date formats and if you omit that, it just sets it to today’s date so you can change it later.

The next step is to figure out how to make this script multi-user aware, without forcing users to expose their username or password in email directly. I have some ideas for that and will test that in v1.1 of this script and release it later on.

The Tracks Forums are pretty busy and lots of people are beating up the code, testing it in many different ways. I’m just trying to contribute back in whatever way I can.. in the hopes that the project continues to thrive and grow.

If you have any suggestions or ideas, let me know!

Validating Blog Pingback Sites with Perl

Tags: , ,

Over the last few months I’ve been wondering what the slow response time has been when I am posting new entries to my blog. Granted, the hits to my blog have more than tripled in the last 2-3 months, but my servers can handle that load. The problem was clearly elsewhere.

Some more digging and I realized that the list of 116 ping sites I have in my blog’s “Update Services” list contains quite a few pingback sites that are no longer valid.

For those new to this, a “Pingback” is a specifically-designed XML-RPC request sent from one website (A, usually a blog site) to another site (B) in response to a new entry being posted on the site.

In order for pings (not to be confused with ICMP pings) to work properly, it requires a physical link in the form of a URL to validate. When (B) receives the notification signal from (A), it automatically goes back to (A) checking for the existence of a live incoming link to (B). If that link exists, the Pingback is recorded successfully.

This “validation” process makes it much less prone to spam attacks than something like Trackbacks. If you’re interested in reading more about how spammers are using Pingbacks and Trackbacks to their advantage, I suggest reading Blog trackback Spam analysis on the “From Information to Intelligence” blog site.

But I needed a way to test all of those ping sites and exclude the ones that are dead, down or throwing invalid HTTP responses… so I turned to Perl, of course!

My list of ping sites is a sorted, uniq’d plain-text list that has one ping site per-line. The list looks something like this:

http://api.moreover.com/ping
http://bitacoras.net/ping
http://blog.goo.ne.jp/XMLRPC
http://blogoole.com/ping
http://blogsearch.google.com/ping/RPC2
http://godesigngroup.com
http://godesigngroup.com/blog/feed
http://imblogs.net/ping
http://ping.bitacoras.com
http://ping.bloggers.jp/rpc
http://ping.blo.gs
http://pinger.blogflux.com/rpc
http://pinger.onejavastreet.com/
http://ping.myblog.jp
http://pingoat.com
http://pingomatic.com
http://rcs.datashed.net/RPC2
http://rpc.blogbuzzmachine.com/RPC2
http://rpc.blogrolling.com/pinger
http://rpc.pingomatic.com
http://rpc.weblogs.com/RPC2
http://rpc.wpkeys.com
...

I pass that list into my perl script and using one of my favorite modules (File::Slurp), I read that file and process each line with the following script:

use strict;
use URI;
use File::Slurp;
use HTTP::Request;
use LWP::UserAgent;

my @ping_sites          = read_file("pings");
my @valid_ping_sites    = ();

for my $untested (@ping_sites) {
        my $url         = URI->new($untested);

        my $ua          = LWP::UserAgent->new;
        $ua->agent('blog.gnu Ping Spider, v0.1 [rss]');
        $ua->timeout(10);

        my $req         = HTTP::Request->new(HEAD=>"$untested");
        my $resp        = $ua->request($req);
        my $status_line = $resp->status_line;

        (my $status)    = $status_line =~ m/(\d+)/;

        if ($status == '200') {
                push @valid_ping_sites, "$url\n";
        } else {
                print "[$status] for $url..\n";
        }
}

my $output      = write_file("pings.valid", @valid_ping_sites);

The output is written to a file called “pings.valid“, which contains all of the sites which return a valid 200 HTTP response. The remainder are sent to STDOUT, resulting in the following output:

$ perl ./pings.pl 
[403] for http://1470.net/api/ping..
[403] for http://api.feedster.com/ping..
[404] for http://bblog.com/ping.php..
[500] for http://blogbot.dk/io/xml-rpc.php..
[403] for http://blogmatcher.com/u.php..
[500] for http://blogsnow.com/ping..
[404] for http://fgiasson.com/pings/ping.php..
[404] for http://pingoat.com/goat/RPC2..
[500] for http://ping.syndic8.com/xmlrpc.php..
[500] for http://ping.weblogalot.com/rpc.php..
[403] for http://popdex.com/addsite.php..
[404] for http://www.blogdigger.com/RPC2..
[500] for http://www.blogsnow.com/ping..
[500] for http://www.blogstreet.com/xrbin/xmlrpc.cgi..
[404] for http://www.catapings.com/ping.php..
[500] for http://www.focuslook.com/ping.php..
[500] for http://www.holycowdude.com/rpc/ping..
[403] for http://www.popdex.com/addsite.php..
[500] for http://xmlrpc.blogg.de..
...

Those failed entries are then excluded from my list, which I import back into WordPress under “Settings → Writing → Update Services”.

The complete, VALID list of ping sites as of the date of this blog posting are the following 49 sites (marking 58% of the list I started with as invalid):

http://1470.net/api/ping
http://api.feedster.com/ping
http://api.moreover.com/ping
http://bblog.com/ping.php
http://bitacoras.net/ping
http://blogbot.dk/io/xml-rpc.php
http://blog.goo.ne.jp/XMLRPC
http://blogmatcher.com/u.php
http://blogoole.com/ping
http://blogsearch.google.com/ping/RPC2
http://blogsnow.com/ping
http://fgiasson.com/pings/ping.php
http://godesigngroup.com
http://godesigngroup.com/blog/feed
http://imblogs.net/ping
http://ping.bitacoras.com
http://ping.bloggers.jp/rpc
http://ping.blo.gs
http://pinger.blogflux.com/rpc
http://pinger.onejavastreet.com/
http://ping.myblog.jp
http://pingoat.com
http://pingoat.com/goat/RPC2
http://pingomatic.com
http://ping.syndic8.com/xmlrpc.php
http://ping.weblogalot.com/rpc.php
http://popdex.com/addsite.php
http://rcs.datashed.net/RPC2
http://rpc.blogbuzzmachine.com/RPC2
http://rpc.blogrolling.com/pinger
http://rpc.pingomatic.com
http://rpc.weblogs.com/RPC2
http://rpc.wpkeys.com
http://www.a2b.cc/setloc/bp.a2b
http://www.blogdigger.com/RPC2
http://www.blogsdominicanos.com/ping/
http://www.blogsnow.com/ping
http://www.blogstreet.com/xrbin/xmlrpc.cgi
http://www.catapings.com/ping.php
http://www.feedsky.com/api/RPC2
http://www.focuslook.com/ping.php
http://www.godesigngroup.com
http://www.holycowdude.com/rpc/ping
http://www.imblogs.net/ping
http://www.pingmyblog.com/
http://www.popdex.com/addsite.php
http://www.wasalive.com/ping/
http://www.xianguo.com/xmlrpc/ping.php
http://xmlrpc.blogg.de

Feel free to use this list in your own blog or pingback list.

If you have sites that aren’t on this list, add them to the comments and I’ll keep this list updated with any new ones that arrive.

Putting an END to WordPress Trackback, Comment and Registration Spam

Tags:

WordPress logo
I run quite a few WordPress blog sites for myself (you’re reading it), my company and for users who wish to have their own blog on the web.

I keep all of these up-to-date with all of the latest versions of WordPress, the latest plugins and any security fixes or updates. Here are a few examples of blog sites I’ve created with WordPress, using some automated tools I’ve written (in Perl of course):

Diabetes Information Resources
Articles, news, reviews and information for the diabetic or caregivers

Acne Treatment Resources and Living With Acne
Acne treatment, support and skin research for teens and adults

Cancer Treatment Information and Resources
A place for cancer patients and caregivers to go for support

(the latter one needs a better theme, I’ll work on that later)

I have already implemented reCAPTCHA for WordPress, Akismet and Bad Behavior. All three of them work very well together without any issues that I’ve seen.

Akismet takes a collaborative approach to combating spam-like comments in your blog. Any comments which contain a high likelihood of being spam are flagged by Akismet and set aside in the quarantine. You can them go back into there and approve/purge those comments as you see fit. According to this blog’s statistics, Akismet has protected my blog from 17,124 spam comments already.

reCAPTCHA helps prevent automated abuse of your site (such as comment spam or bogus registrations) by using a CAPTCHA to ensure that only humans perform certain actions.

reCAPTCHA is very interesting because it actually benefits the community as a whole. When you enter the words presented, you’re actually helping to digitize printed books, by translating words that were OCRd using software to scan actual printed pages, into digital text, to make meaningful sense out of the scanned items.

OCR isn’t a perfect technology and sometimes it makes mistakes. A blurry ‘e’ might be mistaken for a lowercase ‘s’ for example. Human eyes can discern the difference, and this is what reCAPTCHA does. If you want to learn more, you can read more detail about reCAPTCHA on their website.

But this isn’t enough. Spammers are getting smarter and the volume of spammers is increasing at exponential rates.

The nature of Open Source actually hurts us here, because the same tools we use to prevent and block spam, can be downloaded by the spammers, analyzed and their scripts can be modified to circumvent any of the blocking we attempt. These spammers can download the source for Akismet or reCAPTCHA or WordPress and find holes in it to exploit. And that is exactly what they’re doing.

But that only stops people who are using comment forms and are trying to post comments. What about trackback and registration spam?

First, what are these? How are spammers using these to abuse your system or blog?

Trackback Spam (TrackBack plugins at WordPress)

Trackback spam is a technique where individuals or companies abuse the TrackBack feature of a blog to insert spam links on some blogs. Allowing trackbacks allows spammers to actually add content to your pages (in the form of comments).

If you allow trackbacks on your blog, these links will appear on your blog, and direct spiders and other traffic FROM your popular blog site TO their spam or phishing site. Trackbacks do have a positive use, so you can enable them… if you take precautions to protect them accordingly.

One way to do this with WordPress is to rename wp-trackback.php to something else that these spammer’s automated scripts will not be able to “guess”.

You’ll also have to change the reference to wp-trackback.php in the following two files:

wp-includes/template-loader.php
wp-includes/comment-template.php

Most of the automated trackback spam tools will just hit several thousand websites at a time by attempting to send a POST request to wp-trackback.php directly. If you rename it, they won’t find that file on your server, and will get a 404 error. If someone uses the proper comment form on your website, they’ll get the right version of your renamed file.

The other option is to just disable trackbacks altogether. You can find this under SettingsDiscussion. Simply uncheck “Allow link notifications from other blogs (pingbacks and trackbacks.)” This can also be accomplished within each post by unchecking the “Allow pings” checkbox when you compose or edit your posts.

Another option is to use a plugin to try to thwart or validate trackbacks. I use one called Simple Trackback Validation. It was a simple drop-in plugin, and appears to work very well.

When a trackback is received on your blog, Simple Trackback Validation will:

  1. Check to see if the IP of the trackback sender is the same as the IP address of the source the trackback URL is referring to.

    This reveals almost every spam trackback (more than 99%) since spammers do use automated bots which are not running on the machine.

  2. Retrieve the web page at the URL included in the trackback. If the webpage doesn’t a link to your blog, the trackback is considered to be spam. Since most trackback spammers do not set up custom web pages linking to the blogs they attack, this simple test will quickly reveal illegitimate trackbacks.

    Also, bloggers can be stopped abusing trackback by sending trackbacks with their blog software or webservices without having a link to the post.

The combination of these three techniques will stop almost every fake, false or fraudulent trackback your blog may receive.

Registration Spam (Registration plugins at WordPress)

The last one is the most challenging, and very-recently, the most abused; Registration Spam.

Registration spam is relatively new, but it allows someone to “bomb” your blog with thousands of fake usernames and registration requests, which your system will then dutifully attempt to send out a confirmation email to the address specified.. which in most cases will be fake, causing your machine to receive a bounce message in return.

Spammers are using GMail and Yahoo addresses to do this right now, so you might see hundreds or thousands of new users attempting to sign up for your blog every week, all of them fake.

I searched around for awhile to try to figure out what tools or plugins I could use to try to stop this. I found something called WP-Ban, but it doesn’t actually seem to work at all.

WP-Ban claims to ban users by IP, IP Range, host name and referer url from visiting your WordPress’s blog. It will display a custom ban message when the banned IP, IP range, host name or referer url tries to visit you blog. You can also exclude certain IPs from being banned.

In my experience with it, it does not work at all.

I looked for something that would make adding a “plain text” name to the signup field a mandatory item. This means that instead of jdoe@gmail.com being signed up, they would have to also enter “John Doe” in the Name field of the signup form. I found nothing that did this for me.

But I did stumble on something called ‘CapCC’ in my travels that HAS helped. CapCC is a small captcha plugin that works with either comments, registration or both. Since I already used reCAPTCHA and it was having a positive effect, I decided to use CapCC for just user registrations. Now the incoming users have to enter a small 5-character string before their registration can be processed.

As a result of this, I now allow anonymous people to post comments (moderated, of course). I don’t have to worry about fake users trying to join, abuses of my MTA or other garbage.

Hopefully others will find this useful.

Returning a list of anonymous proxies

Tags:

Back in October of 2007, I started writing a little tool to build MFA 2.0 sites on the fly.

This tool (in Perl of course), allows me to create a new WordPress blog targeted to a very specific niche, populate the WordPress database with hundreds/thouands of articles that target that niche, and some other fancy things with lots of trickery under the hood. My Diabetes Information and Acne Skin Treatment websites are two examples of works I created in about 30 minutes with this tool back in October.

The article sites that I point to for content are attempting to drive traffic to their site and they implement all sorts of tricks on the server-side to try to thwart spidering and bots. They want “real humans” to read their content.

So I came up with the idea of using a random proxy server for each request. It slows down the speed with which I can spider articles, but it also doesn’t put me on an automatic block/ban list.

The problem with public proxy lists is that they become stale very quickly, so I needed a way to make sure every proxy I use is alive, valid and accepting connections to the remote site I’m querying for article content.

Enter my return_proxies() function in Perl, which does just this:

sub return_proxies {
        my $link        = 'http://proxy-site/list.txt';

        my $ua = LWP::UserAgent->new;
        my $rand_browser = random_browser();
        $ua->agent($rand_browser);

        my $req         = HTTP::Request->new(GET => $link) or die $!;
        my $res         = $ua->request($req);
        my $status_line = $res->status_line;
        my $html        = $res->content;

        my $t           = HTML::TreeBuilder->new_from_content($html);
        my @output      = map $_->as_HTML, $t->look_down(_tag => 'td', class => qr/dt-tb?/);
 
        my @proxies;
        foreach my $ip (@output) {
                (my $address) = $ip =~ /((?:\d+\.){3}\d+\:\d+)/;
                push @proxies, $address if $address;
        }

        # print Dumper(@proxies);
        return @proxies;
}

I call this in my fetch_page() function like this:

        my @proxies     = return_proxies();
        my $rand_proxy  = "http://$proxies[rand @proxies]“;
        $ua->proxy(['http', 'ftp'], $rand_proxy);

So far it works very well, no issues at all that I’ve seen.

Obviously there’s a lot more to it than just this… but I can’t give away all of the secrets to my code, can I?

A Busy Weekend to End a Busy Week

Tags: , , ,

This weekend was just as busy as the week at work. It’s Sunday afternoon, and I’m still going…

Reconstructing Maildir from Backups

    Moments ago, I found that my archive of the Coldsync mailing list in Maildir format somehow became corrupt, so attempts to copy those messages to Gmail failed using my Thunderbird trick.

    I found an older copy that was in mbox format, and used the “Perfect” mbox to Maildir converter script to convert it to Maildir format.

    Now I’m back to populating Gmail with my email once again (8,427 in Gmail now, with about 112,000 left to go).

Calendaring Conundrum

    Also this past week, I realized that my calendar in Outlook had somehow duplicated over 1,700 of my events. I’m sure it was the result of using things like PocketMirror and other sync tools for Palm with it. I’m going to be cleaning that up next. That requires manual, visual inspection of each event, to make sure I’m deleting the dupe and not the original (thus leaving the dupe copy on the calendar). Very odd.

    Once that is done, I have to reinstall all of my Palm conduits on the Thinkpad X61s and get that all sync’d to my Treo. With everything on my Treo, I can then begin consolidating my various calendars and task lists into one clean interface.

Back to the Blog; 8 years of postings

    I also cleaned up 8 years of blog postings, reformatted them all and cleaned up the broken HTML that was the result of importing the diary entries from Advogato. That was 353 separate posts to go through by hand and clean everything up. Now it looks as it should.

    Going back and reading through those old diary posts was… interesting. I didn’t realize how much I’d done in those 8 years, all of the people I’d met, projects I’d completed, places I’d been. I might turn the whole blog into a set of memoirs for my daughter Seryn for when she’s old enough to understand all of the things her daddy did in his life.

Movies, movies, movies!

    I managed to pack in watching 4 movies while I worked this weekend, but the two best ones were “Maxed Out” and “The Man from Earth“. Both of them were equally good, and worth watching. I highly recommend both of them.

    Maxed Out” was eye-opening, and depressing at points, because of the situation our country is in right now. People are literally killing themselves (3 cases are described in the movie) because of their debt. The industry specifically caters to those who can NOT pay their debts down, because those people are the cash-cow for them. These people pay their minimum payments for life and pass on their debts to their children. They don’t want people who pay their credit cards in-full every month to be their customers, there’s no profit in that. Watch the movie for the rest of the details.

    The Man from Earth” brings new ideas to our concepts of religion, biology, archeology and many other fields of traditional study. It reminded me somewhat of the information that was in the first 1/3 of “Zeitgeist: The Movie” (freely downloadable, or viewable online). The Man from Earth is a low-budget movie, but packs a punch in the back story. I won’t spoil it here, but definitely go rent it if you can.

AT&T WWAN in VMware

    I managed to get my physical Windows machine (an HP machine I purchased at BestBuy a few years ago), virtualized and configured in VMware using VMware Converter. I had to hack into it to get the vm to recognize my legitimate Microsoft Product Key, but after that, it was a snap.

    att-gt-max-expresswrt54g3g

    Then I installed and configured the AT&T Communication Manager software to talk to the physical SIM card inside my laptop, so I can go online with the laptop wherever there is valid GSM signal, at 3G speeds.

    I didn’t think the vm would recognize the physical card in the machine, if Linux didn’t see it natively… but it does. It’s a bit slow, but at least I can function with one laptop connected to the WWAN on the train with a larger screen at 1920×1200 resolution, instead of the smaller laptop with the 12″ screen at 1024×768 resolution.

    The next step is to get the second laptop networking across the connection that the first laptop provides. That should be interesting to solve, since one of these is Windows, and the other one is Linux.

    One possible solution would be to take the SIM card that is physically inside the laptop, put it into an external 3G PC-Express card (as in the image here), and then put that into a WWAN router, and carry THAT with me on the train. It has to be portable, of course…

    But if I use that approach, not only can I share the connection with both of my laptops, I can also provide “free” wireless to anyone on the train who wants to get online. Maybe I can solicit free beer or donations in the cafe car to help offset those costs.

New Web SEO

    An old acquaintance from a Business Development Group in New London has recently contacted me asking me for help with his website. He pointed out that his site is losing customers to two other sites in his very narrow niche here in Connecticut.

    I looked at the competition, and noted that they’re not doing anything special that would merit that increase in customers for them. But I also noted about a dozen problems with this person’s company website in question, that needs immediate attention. His site is ranking at PR0, and the other two sites are pushing PR1/PR2, with no real traffic.

    So I think I might pick up a little side project to help him out, to bring his site up to where it should be, and up to the level of standards that I consider acceptable. It looks like it’ll be fun, and it should bring in more income to help fund some projects I’m working on in the background (mostly Seryn’s secret project :)

    I make a decent amount of money with my websites now, and I think with the proper care and attention, he can too.

Wrestling MySQL

    Speaking of websites, a few months back, I started making some MFA 2.0 websites of my own, based on WordPress that are populated with hundreds of public-domain articles on niche content.

    To do that, I wrote some tools (in Perl, of course) to spider the remote sites, pull the articles, and stick them into my WordPress databases, complete with proper author in the Author field, original posting date and so on.

    Here’s one example that took me less than an hour to create and populate with articles, using these tools.

    This particular site has 220 articles in it, but if you look at them articles, they’re all linked to external quality citations and resources. This site, with zero marketing, pulls in an average of about 6,000 hits a month. I have a handful of others, with between 200 and 1,000 articles in each, all of similar high-quality and care behind their creation.

    The hard part is that in order to have an article attributed to an author in WordPress, that person has to be in the User table. That gets complicated to get right, but I managed to figure it out.

    But I needed a way to ensure that the site’s content didn’t remain “stale”, so I came up with another trick to do that:

    UPDATE wp_posts SET post_date='1999-01-01' + interval rand() * 3391 day;
    UPDATE wp_posts SET post_modified = post_date;
    

    I put this in /etc/cron.weekly/, and now my articles are randomized on a weekly basis, so visitors or search engines coming to the site will constantly see “new” content.

    So far, they’re doing well, and bringing in more than enough to cover the monthly costs of running the servers, bandwidth and power.

The Day Job

    The day job continues to go well, and I’m picking up significant speed on those projects.

    The more I work within the system, the more I see where some “optimization” can be put into place, including some automation to make things much easier for myself and others in my group. I need to carve out some time to do exactly that, within the limits of my time allotted for the critical tasks that have established deadlines.

    The commute has worked itself out and my routine is fairly static now, so that is no longer an unknown. Now I just need to get my foot in the door and keep things moving forward at a cheetah’s pace.

Fun times, definitely fun times. The positive energy has come back in rushing waves, and the good luck has overwhelmed me.

Bad Behavior has blocked 513 access attempts in the last 7 days.