Quantcast
Channel: MyBB Community Forums - Tutorials
Viewing all articles
Browse latest Browse all 685

Spider traffic calming through Apaches .htaccess

$
0
0
Internet spiders are increasing in number constantly.  Some are written well and follow the guidelines for robots.txt or use included Meta tags to work out what they are (and aren't) allowed to spider.  There are however some that seem to negate any of those controls or rules and just go for all out scraping which can
create a number of problems from DDoS effects to reducing your sites capacity to be viewed by humans.

While there are things you can do within the forum software to help alleviate the problems (To my knowledge, Clients identified as spiders will be dropped when there is high traffic before humans users start suffering from 503 errors)

This thread/post covers a far more simplified approach to aid in calming the traffic for any spiders you deem rogue through the use of APACHE .htaccess (so it should work with most environments using an APACHE webserver tested on >v2.4)

There are some pre-requisites you need to know:
  • Which spiders do you want to traffic calm?
  • When is the the lowest traffic time period?

Which spiders do you want to traffic calm?
PHP Code:
<IfModule mod_setenvif.c>
 
   SetEnvIfNoCase User-Agent (roguebot1|roguebot2){1throttlebot=1
</IfModule

In this piece that can be placed into the .htaccess file, I use the SetEnvIf module to identify which User-Agent strings need to be traffic calmed.  

It's Either roguebot1 or roguebot2.  If you wanted to add another option for a roguebot3, just make sure you add a '|' pipe character after roguebot2 and insert roguebot3 there.
(You only need the bot name before the / and version number, so Googlebot/2.0 would just be Googlebot )

The {1} matches the first instance of the mentioned spidername (just in case the user-agent string contains multiple matches)

throttlebot=1 is how a Server Environment is set as throttlebot with a value of "1".  This environment is actually what Identifies what needs throttling.  So you could use other methods of setting the server environment which I'm not going to detail here.

When is the the lowest traffic time period?
In my example I have low traffic between 09:00-11:59 and 13:00-15:59 Server time.  So the intention here is to intermittently limit access for traffic calming outside of these times.   I don't want to block spiders in this instance as that can have repercussions later in both SEO ranking and of course server stability should the spider have a chance to catch up on indexing.

This uses mod_rewrite

PHP Code:
<IfModule mod_rewrite.c>
 
   # Initialise the Rewrite Engine if not already Initialised
 
   RewriteEngine on
    RewriteBase 
/

 
   #If its between the hours of 0 to 8, 12, or 16 to 23
 
   # set an environment of trafficcalm
 
   RewriteCond %{TIME_HOUR} >00
    RewriteCond 
%{TIME_HOUR} <08 
 
   RewriteRule ^ - [E=trafficcalm:1]

 
    RewriteCond %{TIME_HOUR} =12 
     RewriteRule 
^ - [E=trafficcalm:1]

 
    RewriteCond %{TIME_HOUR} >16
     RewriteCond 
%{TIME_HOUR} <23
     RewriteRule 
^ - [E=trafficcalm:1]
 
 
   # If trafficcalm and throttlebot are set and the request isn't for robots.txt
 
   # and it's between 10 and 50 seconds of the minute
 
   # Redirect to a 503 and set environment throttled else set environment normaltraffic
 
    RewriteCond %{ENV:trafficcalm1
     RewriteCond 
%{ENV:throttlebot1
     RewriteCond 
%{REQUEST_URI} !^\/robots\.txt$ [NC]
 
    RewriteCond %{TIME_SEC} >10
     RewriteCond 
%{TIME_SEC} <50
     RewriteRule 
^ - [E=throttled:1,R=503,L]
 
    RewriteRule ^ - [E=normaltraffic:1]
</
IfModule
You'll notice that in the last Rewrite block, there are two RewriteRule's  The first is what occurs when the time is greater than 10 but less than 50 seconds on the minute, the second occurs outside that range (0-9 and 51-59)

During the first rule a redirect to 503 Service Temporarily Unavailable will occur.  Spiders that are written correctly when not getting the code they want (301 etc) will usually attempt to get the URL again a little later.    

Now this .htaccess addition is great on it's own, however you can decrease the chance of spiders not being able to spider by making a couple of other additions.  

Adding the line of Crawl-Delay: 43 to your robots.txt file.
(The idea here is that if the intermittent block occurs for 40 seconds, you want the spider to try again a little bit later than 40 seconds from when it made it's current attempt.  This will make sure that it cycles being in and out of being retrievable.)

Create a Custom 503 page (503.php), something like:
PHP Code:
<?php
header
("HTTP/1.1 503 Service Temporarily Unavailable");
header("Status: 503 Service Temporarily Unavailable");
header("Retry-After: 43");
?>
<!DOCTYPE html>
<html>
<head>
<title>503 Service Temporarily Unavailable</title>
</head>
<body>
<h1>503 Service Temporarily Unavailable</h1>
<p>Don't threat, access should return shortly.  The server just needed a reprieve.</p>
</body>
</html> 

You'll notice the header of "Retry-After: 43" being used here to attempt to convey to spiders, how long to wait before retrying.  It's similar to the Crawl-Delay, but meant for spiders that ignore the robots.txt file completely.

And then just add to your .htaccess file the following to make sure it uses the custom script:
PHP Code:
ErrorDocument 503 503.php 
It assumes that the 503.php file is in your root folder.

So your .htaccess should look something like this (although you'll need to merge it with what you might already have in your .htaccess):
PHP Code:
ErrorDocument 503 503.php

<IfModule mod_setenvif.c>
 
   SetEnvIfNoCase User-Agent (roguebot1|roguebot2){1throttlebot=1
</IfModule

<
IfModule mod_rewrite.c>
 
   # Initialise the Rewrite Engine if not already Initialised
 
   RewriteEngine on
    RewriteBase 
/

 
   #If its between the hours of 0 to 8, 12, or 16 to 23
 
   # set an environment of trafficcalm
 
   RewriteCond %{TIME_HOUR} >00
    RewriteCond 
%{TIME_HOUR} <08 
 
   RewriteRule ^ - [E=trafficcalm:1]

 
    RewriteCond %{TIME_HOUR} =12 
     RewriteRule 
^ - [E=trafficcalm:1]

 
    RewriteCond %{TIME_HOUR} >16
     RewriteCond 
%{TIME_HOUR} <23
     RewriteRule 
^ - [E=trafficcalm:1]
 
 
   # If trafficcalm and throttlebot are set and the request isn't for robots.txt
 
   # and it's between 10 and 50 seconds of the minute
 
   # Redirect to a 503 and set environment throttled else set environment normaltraffic
 
    RewriteCond %{ENV:trafficcalm1
     RewriteCond 
%{ENV:throttlebot1
     RewriteCond 
%{REQUEST_URI} !^\/robots\.txt$ [NC]
 
    RewriteCond %{TIME_SEC} >10
     RewriteCond 
%{TIME_SEC} <50
     RewriteRule 
^ - [E=throttled:1,R=503,L]
 
    RewriteRule ^ - [E=normaltraffic:1]
</
IfModule


When this is used you will find that your server logs will start outputting 503 responses every so often to those roguebots you've listed reducing your overall server load hopefully increasing your sites accessibility and stability.

www.scivillage.com

Viewing all articles
Browse latest Browse all 685

Trending Articles