[BlueOnyx:14329] Re: php-cgi and web crawlers

Thu Jan 30 10:26:29 -05 2014

Hi Stephanie,

> As a suggestion for a feature improvement: It would be nice to be able to
> put a list (one per line like email aliases) of regex's for browser ident
> strings to block in the gui.

Yeah, I have something for that, but it needs some cleaning.
devel.blueonyx.it uses the Perl based TRAC frontend for the BlueOnyx
SVN. It never gets much visits, but due to the heavy Perl scripting and
the many file transactions needed accesses there are a bit CPU intensive
if it gets hit hard.

That VPS was getting heavily hit by crawlers who try to follow every
link. We have like 1500 SVN revisions and each page has many links. So
that's a lot of crawling. And the bloody Microsoft search-engine "Bing"
of course doesn't honor robots.txt which explicitly prohibits crawlers
there. There have been days where several Bing crawlers were roving over
the site at the same time, because the previous crawls were still
ongoing due to the amount of links.

Based on browser ident strings I now redirect them elsewhere. Same for
the primary mirrors, which are behind a Varnish cache. If a crawler
doesn't honor robots.txt Varnish shows them a snappy custom error
message instead and keeps the load off the backend.

I'll put it onto the feature list, but I agree: It'll be a low priority
item that'll only get tackled after the new BlueOnyx GUI is finished.

-- 
With best regards

Michael Stauber