[BlueOnyx:14327] Re: php-cgi and web crawlers

Thu Jan 30 09:58:02 -05 2014

> -----Original Message-----
> From: Ernie [mailto:ernie at info.eis.net.au]
> Sent: Thursday, January 30, 2014 12:24 AM
> To: blueonyx at mail.blueonyx.it
> Subject: [BlueOnyx:14321] php-cgi and web crawlers
> 
> I have a site running Joomla that grinds to a halt when web crawlers
> vist
> the site, I have seen it happen with Wordpress too, probably
> something to do
> with them both needing suPHP.
> 
> The symptoms are dozens of php-cgi processes spawning, causing high
> CPU
> load, and making sendmail stop responding due to the load until some
> time
> after the web crawler has finished.
> 
> The hardware is quite reasonable, Xeon based, with a lot of RAM and
> fast drives
> etc.
> 
> Is there any tuning I can do to prevent so many php-cgi processes
> spawing at
> once, or to get them over and done with a bit faster? 
> 
> - Ernie.

Ernie,

I experience this too from time to time. I have found 2 reasons this
happens. 

One is if the memory limit for php is too low for a site it can cause a
php-cgi instance to core when it runs out of memory. This uses lots of CPU
and can make it easy to lock-up a server. You should be able to find this in
your /var/log/httpd/error_log file. Search for memory or exceeded - it's
been a while I may not be remembering the terms exactly. 

PHP with joomla, drupal, wordpress or other frameworks can use an amazing
amount of memory depending on plugins. Some are horrible. I had one site
that exceeded 128MB because of a wordpress membership plugin. Turns out it
suggests using a dedicated and pretty hefty server with more than 500
members. Sheesh!! There was an alternative I suggested to the client (a
couple years ago) that used a fraction of the resources. Just using that
module *was an attack* on the server!

The other is just what you are suggesting: crawlers/bots:

Legit search engine crawlers generally way 15 to 30 seconds between page
loads. Those that don't wait may be malicious or someone who does not follow
crawler netiquette.

With these automated "attacks" badly written crawlers or bots can often be
identified by their ident string found in the logs and blocked via
robots.txt or if they don't respect robots.txt, in a .htaccess (or better,
the sitexx.include file as those settings are already in apache memory).

The problem arises when the offending crawler or bot is using a real browser
ident string (spoofing a real browser). In my opinion these are probably
brute force password attacks, site content scrapers, or denial of service
attacks. The idea is to break in  or take down the site.

For password attacks I have found several password protection plugins that
are free. I expect they also exist for Joomla and Drupal. These block
(temporarily or permanently) access for IP's of offenders. 

To summarize, good crawlers/bots will leave several seconds (properly 15 to
30) between accesses, respect robots.txt, have a useful ident string. Bad
crawlers/bots may do any of the following: spoof real browser's ident
strings, hammer your site with back to back accesses, try brute force
password cracking, try to shut down your site with a DOS.

I hope this is helpful. 

As a suggestion for a feature improvement: It would be nice to be able to
put a list (one per line like email aliases) of regex's for browser ident
strings to block in the gui. This has been very rarely needed in the past
but it has become more useful recently. I'd still rate this as a medium-low
priority.

      -Stephanie