使用scrapy框架的中间件(Middleware)设置随机请求头

先scrapy startproject [爬虫项目名字]cd 进去，再scrapy genspider [爬虫名字] “http://httpbin.org/”之所以要用这个url是因为这个网站只返回你的user-agent，便于验证。先看一下两个方法：上面的图片结合下面的图一起看比较好(来源网络，侵权删)：process_request在下载器发送请求前执行，通常在这个方法里设置请求头或者代理

路漫漫`

1226人浏览 · 2020-05-18 21:57:55

路漫漫` · 2020-05-18 21:57:55 发布

先scrapy startproject [爬虫项目名字]
cd 进去，再scrapy genspider [爬虫名字] “http://httpbin.org/”

之所以要用这个url是因为这个网站只返回你的user-agent，便于验证。

先看一下两个方法：
process_request
process_response
上面的图片结合下面的图一起看比较好(来源网络，侵权删)：

process_request

在下载器发送请求前执行，通常在这个方法里设置请求头或者代理ip
需要两个参数：request，spider
返回值：

None ：上图自左向右，设中间件1的返回值为None，那么会将这个请求发送给中间价2.
Response：设中间件1返回值是Response对象，那么将不会发送给中间件2，而是会给process_response，进而给引擎。
Request：设中间件1返回值是Request对象，那么将这个新的对象给中间件2，而不是旧的Request对象。
异常会调用process_exception方法。

process_response

数据已经下载完毕，即将给引擎
三个参数：request，response，spider
返回值：

Response：设中间件3返回值是Response对象，那么会将这个新的对象给中间件2，而不是旧的Response对象。
Request：设中间件3返回的值Requset对象，那么它会接着向下载器发送请求，去进行下载。
异常会调用Request的errback方法，如果没有指定这个方法会抛出一个异常。

代码部分

爬虫主程序

# -*- coding: utf-8 -*-
import scrapy
import json

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/user-agent']

    def parse(self, response):
        user_agent = json.loads(response.text)['user-agent']
        print(response.text)
        yield scrapy.Request(self.start_urls[0], dont_filter=True)

要注意的是改一下start_urls就可以了，yield Request可以让爬虫一直请求这个页面，后面的dont_filter是不让scrapy自动去重。

middlewares.py

在middlewares.py中，添加一个类，并实现上述方法：

class HttprequsetheaderDownloaderMiddleware:
    # 在这里添加请求头列表
    header = [
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
    ]

    def process_request(self, request, spider):
        user_agent = random.choice(self.header)
        request.headers['User-Agent'] = user_agent

由于这篇的目的只是添加请求头，所以只需要实现这一个方法。

settings.py

在这里需要加上

DOWNLOADER_MIDDLEWARES = {
   'HttpRequsetHeader.middlewares.HttprequsetheaderDownloaderMiddleware': 543,
}

要注意名字是我们前面写的类的名字

运行结果

{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"
}

2020-05-18 21:22:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
}

2020-05-18 21:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
}

2020-05-18 21:22:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
}

2020-05-18 21:22:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
}

2020-05-18 21:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
}

2020-05-18 21:22:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: http://httpbin.org/user-agent)
{
  "user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"

可以看到每次运行的请求头是随机的，因此实现了功能。

MCP技术社区

欢迎加入 MCP 技术社区！与志同道合者携手前行，一同解锁 MCP 技术的无限可能！

更多推荐

【CodeBuddy + 自制MCP】给AI装上翅膀，快速绘制思维导图

MCP技术社区

如何将普通HTTP API接口改造为MCP服务器

创建.proto通过本文的四步改造法，你可获得：✅ 配置更新延迟降低90%✅ 网络带宽消耗减少70%✅ 服务端资源占用下降60%✅ 原生支持百万级节点连接升级到MCP不仅是协议转换，更是配置分发模式的架构进化。立即行动，让你的微服务配置管理进入实时推送时代！更多Istio进阶技巧请关注专栏【Service Mesh深度实践】