找回密码
 立即注册
搜索
热搜: 活动 交友 discuz
查看: 434|回复: 0

服务器反爬虫:Apache/Nginx禁止某些User Agent抓取网站

[复制链接]

119

主题

61

回帖

130万

积分

管理员

积分
1300855
QQ
发表于 2019-4-1 13:53:49 | 显示全部楼层 |阅读模式
服务器反爬虫:Apache/Nginx禁止某些User Agent抓取网站

一、Apache
    ①、通过修改 .htaccess 文件修改网站目录下的.htaccess,添加如下代码即可(2 种代码任选):
           可用代码 (1):
  1. RewriteEngine On
  2. RewriteCond %{HTTP_USER_AGENT} (^$|YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|Go-http-client|Cliqzbot) [NC]
  3. RewriteRule ^(.*)$ - [F]
复制代码


           可用代码 (2):
  1. SetEnvIfNoCase ^User-Agent$ .*(YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|Go-http-client|Cliqzbot) BADBOT
  2. Order Allow,Deny
  3. Allow from all
  4. Deny from env=BADBOT
复制代码


通过修改 httpd.conf 配置文件
           找到如下类似位置,根据以下代码 新增 / 修改,然后重启 Apache 即可:

  1. DocumentRoot /home/wwwroot/xxx
  2. <Directory "/home/wwwroot/xxx">
  3. SetEnvIfNoCase User-Agent ".*(YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|Go-http-client|Cliqzbot)" BADBOT
  4.         Order allow,deny
  5.         Allow from all
  6.        deny from env=BADBOT
  7. </Directory>
复制代码


二、Nginx 代码

进入到 nginx 安装目录下的 conf 目录,将如下代码保存为 agent_deny.conf

cd /www/server/nginx/conf

vim agent_deny.conf

  1. #禁止Scrapy等工具的抓取
  2. if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
  3.      return 403;
  4. }
  5. #禁止指定UA及UA为空的访问
  6. if ($http_user_agent ~* "YisouSpider|yisouspider|MJ12bot|BLEXBot|java|Baiduspider|Baiduspider-image+|AhrefsBot|Googlebot|YandexBot|Uptimebot|bingbot|Sogou web spider|Nimbostratus-Bot|python-requests|ips-agent|Researchscan|SemrushBot|360Spider|Python-urllib|zgrab|special_archiver|archive.org_bot|facebookexternalhit|DotBot|Dataprovider.com|nsrbot|panscient.com|HTTrack|Apache-HttpClient|^$" ) {
  7.      return 403;            
  8. }
  9. #禁止非GET|HEAD|POST方式的抓取
  10. if ($request_method !~ ^(GET|HEAD|POST)$) {
  11.     return 403;
  12. }
复制代码

然后,在网站相关配置中的  location / {  之后插入如下代码:

  1. include agent_deny.conf;
复制代码

如zhaoying.org的配置:

  1. [root@4rdbg /]# cat /www/server/panel/vhost/nginx/zhaoying.org.conf
  2. server
  3. {
  4.     listen 80;
  5.     listen 443 ssl http2;
  6.     server_name zhaoying.org www.zhaoying.org;
  7.     index index.php;
  8.     root /www/wwwroot/zyphoto;
  9.    
  10.     #防爬虫
  11.     include agent_deny.conf;
  12.     #防爬虫
复制代码

保存后,执行如下命令,平滑重启 nginx 即可:

  1. /etc/init.d/nginx restart
复制代码
高级模式
B Color Image Link Quote Code Smilies

本版积分规则

QQ|手机版|小黑屋|留言系统 ( 蜀ICP备:没有号 )

GMT+8, 2025-1-23 06:59 , Processed in 0.091450 second(s), 20 queries .

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表