服务器反爬虫:Apache/Nginx禁止某些User Agent抓取网站
一、Apache
①、通过修改 .htaccess 文件修改网站目录下的.htaccess,添加如下代码即可(2 种代码任选):
可用代码 (1):
- RewriteEngine On
- RewriteCond %{HTTP_USER_AGENT} (^$|YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|Go-http-client|Cliqzbot) [NC]
- RewriteRule ^(.*)$ - [F]
复制代码
可用代码 (2):
- SetEnvIfNoCase ^User-Agent$ .*(YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|Go-http-client|Cliqzbot) BADBOT
- Order Allow,Deny
- Allow from all
- Deny from env=BADBOT
复制代码
②、通过修改 httpd.conf 配置文件
找到如下类似位置,根据以下代码 新增 / 修改,然后重启 Apache 即可:
- DocumentRoot /home/wwwroot/xxx
- <Directory "/home/wwwroot/xxx">
- SetEnvIfNoCase User-Agent ".*(YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|Go-http-client|Cliqzbot)" BADBOT
- Order allow,deny
- Allow from all
- deny from env=BADBOT
- </Directory>
复制代码
二、Nginx 代码
进入到 nginx 安装目录下的 conf 目录,将如下代码保存为 agent_deny.conf cd /www/server/nginx/conf vim agent_deny.conf - #禁止Scrapy等工具的抓取
- if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
- return 403;
- }
- #禁止指定UA及UA为空的访问
- if ($http_user_agent ~* "YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|^$" ) {
- return 403;
- }
- #禁止非GET|HEAD|POST方式的抓取
- if ($request_method !~ ^(GET|HEAD|POST)$) {
- return 403;
- }
复制代码
然后,在网站相关配置中的 location / { 之后插入如下代码:
如zhaoying.org的配置: - [root@4rdbg /]# cat /www/server/panel/vhost/nginx/zhaoying.org.conf
- server
- {
- listen 80;
- listen 443 ssl http2;
- server_name zhaoying.org www.zhaoying.org;
- index index.php;
- root /www/wwwroot/zyphoto;
-
- #防爬虫
- include agent_deny.conf;
- #防爬虫
复制代码
保存后,执行如下命令,平滑重启 nginx 即可: - /etc/init.d/nginx restart
复制代码 |