Skip to content Skip to sidebar Skip to footer

How To Detect Http Response Status Code And Set A Proxy Accordingly In Scrapy?

Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code? For example, start up with an IP form an IP list till it gets a 503 response (o

Solution 1:

Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy to your Request and use that proxy for this request.

There are few ways of doing this. First one, straight-forward just use it in your spider code:

defparse(self, response):
    if response.status inrange(400, 600):
        return Request(response.url, 
                       meta={'proxy': 'http://myproxy:8010'}
                       dont_filter=True)  # you need to ignore filtering because you already did one request to this url

Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:

from project.settings import PROXY_URL
classMyDM(object):
    defprocess_response(self, request, response, spider):
        if response.status inrange(400, 600):
            logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
            return Request(response.url, 
                           meta={'proxy': PROXY_URL}
                           dont_filter=True)
        return response

Note that by default scrapy doesn't let through any response codes other than 200 ones. Scrapy automatically handles redirect codes 300 with Redirect middleware and raises request errors on 400 and 500 with HttpError middleware. To handle requests other than 200 you need to either:

Specify that in Request Meta:

Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for allRequest(url, meta={'handle_httpstatus_all': True})

Set a project/spider wide parameters:

HTTPERROR_ALLOW_ALL = True# for allHTTPERROR_ALLOWED_CODES = [404, 505]  # for specific

as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes

Post a Comment for "How To Detect Http Response Status Code And Set A Proxy Accordingly In Scrapy?"