Two such issues, actually.
When I recently installed a SEO plugin, it included a log of all “404” calls. Each “404” is a server response to an attempt to find a non-existence page the blog.
One of these issues involves a blatant attempt to fish for specific PHP files (the scripting language files that serve as the background for the website) that have a known security error. This file, named “timthumb.php” is not present in the standard WordPress installation but it is included in some themes and plugins, and is used to manipulate screen image files. The intent is to use access to this file to bypass the website security by taking advantage of this file’s ability to write a any kind of file into the WordPress directory, after which the person can use that file to gain access to the entire directory system, upon which they are able to modify existing PHP files or install their own software there.
Fortunately this website is not affected: I don’t have or use any other themes or plugins which include that specific file. However, the intermittent, repeated attempts to find this file does cause some load on the system and are annoying, which is why I am trying to block them any way I can.
The other issue involves the “spider” robot, the web device that scans website sites for information and changes to websites. All the major web search sites, like Google, Bing, Yahoo, etc., use them, and for the most part, they are well-behaved. But there is one that is not, and that’s the Baidu spider robot. Baidu is the major Chinese web search site. Ever since I installed the “404” monitor, I have seen dozens, if not over a hundred, attempts a day of the Baidu spider crawling my blog and searching for a specific, non-existent file under a combination of many different locations. Its almost as if the spider robot program is badly designed and doesn’t understand that is completely missing the picture here.
What links these two issues is the fact that I have not been able to block either using the two common website functions “robots.txt” and “.htaccess”. The Baidu system says that its spider robot obeys the “robots.txt” file but other web commentary insists that it doesn’t. The scanner that hunts for the “timthumb.php” file probably doesn’t either. That said, I have set the “robots.txt” file to disallow those two spider robots, without success. This is what I am using:
User-agent: Baiduspider Disallow: / User-agent: Baiduspider/2.0 Disallow: / User-Agent: PycURL/7.19.7 Disallow: /
The other function is to use the “.htaccess” file, which is a system level directive to the server to ignore these robots according to the user agent name they give when attempting to access the website. Unfortunately, this tile is a little more difficult to code. This is what I have been recommended to use.
#Block bad bots SetEnvIfNoCase User-Agent "^Baidu[Ss]pider" bad_bot=1 SetEnvIfNoCase User-Agent "^PycURL" bad_bot=1 Order Allow,Deny Allow from all Deny from env=bad_bot
Unfortunately, it doesn’t appear to work, either. I don’t know if this is a problem of coding the restrictions or if the restriction file is not in the correct places: this is an area that I have little experience with. My website host customer service has not been much help, either.
If anyone has a suggestion to make, feel free to respond.