[BlueOnyx:14331] Re: php-cgi and web crawlers

Stephanie Sullivan ses at aviaweb.com
Thu Jan 30 13:21:11 -05 2014


> -----Original Message-----
> From: Michael Stauber [mailto:mstauber at blueonyx.it]
> Sent: Thursday, January 30, 2014 10:26 AM
> To: BlueOnyx General Mailing List
> Subject: [BlueOnyx:14329] Re: php-cgi and web crawlers
> 
> Hi Stephanie,
> 
> > As a suggestion for a feature improvement: It would be nice
> > to be able to put a list (one per line like email aliases)
> > of regex's for browser ident strings to block in the gui.
> 
> Yeah, I have something for that, but it needs some cleaning.
> devel.blueonyx.it uses the Perl based TRAC frontend for the
> BlueOnyx SVN. It never gets much visits, but due to the heavy
> Perl scripting and the many file transactions needed accesses
> there are a bit CPU intensive if it gets hit hard.
> 
> That VPS was getting heavily hit by crawlers who try to follow
> every link. We have like 1500 SVN revisions and each page has
> many links. So that's a lot of crawling. And the bloody
> Microsoft search-engine "Bing" of course doesn't honor
> robots.txt which explicitly prohibits crawlers there. There 
> have been days where several Bing crawlers were roving
> over the site at the same time, because the previous crawls
> were still ongoing due to the amount of links.
> 
> Based on browser ident strings I now redirect them elsewhere.
> Same for the primary mirrors, which are behind a Varnish cache.
> If a crawler doesn't honor robots.txt Varnish shows them a
> snappy custom error message instead and keeps the load off
> the backend.
> 
> I'll put it onto the feature list, but I agree: It'll be a
> low priority item that'll only get tackled after the new
> BlueOnyx GUI is finished.
> 
> With best regards
> Michael Stauber

Michael,

Yes - low. And this sort of regex thing can make it really easy to
accidently mess things up badly. It's definitely a double edged sword.

I thought a little more and often when I'm seeing high cpu it's someone with
a wordpress or other CMS using plugins that make the php processes exceed
their memory limit. Depending how bad it is I'll up their limit or encourage
them to look for another plugin that is more reasonable. There are a lot of
poorly written plugins out there. Some brilliant ones but a lot of chaff
with that wheat.




More information about the Blueonyx mailing list