How To Detect Http Response Status Code And Set A Proxy Accordingly In Scrapy?
Solution 1:
Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy
to your Request
and use that proxy for this request.
There are few ways of doing this. First one, straight-forward just use it in your spider code:
defparse(self, response):
if response.status inrange(400, 600):
return Request(response.url,
meta={'proxy': 'http://myproxy:8010'}
dont_filter=True) # you need to ignore filtering because you already did one request to this url
Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:
from project.settings import PROXY_URL
classMyDM(object):
defprocess_response(self, request, response, spider):
if response.status inrange(400, 600):
logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
return Request(response.url,
meta={'proxy': PROXY_URL}
dont_filter=True)
return response
Note that by default scrapy doesn't let through any response codes other than 200
ones. Scrapy automatically handles redirect codes 300
with Redirect middleware
and raises request errors on 400
and 500
with HttpError middleware. To handle requests other than 200 you need to either:
Specify that in Request Meta:
Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for allRequest(url, meta={'handle_httpstatus_all': True})
Set a project/spider wide parameters:
HTTPERROR_ALLOW_ALL = True# for allHTTPERROR_ALLOWED_CODES = [404, 505] # for specific
as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes
Post a Comment for "How To Detect Http Response Status Code And Set A Proxy Accordingly In Scrapy?"