How To Fetch The Response Object Of A Request Synchronously On Scrapy?
Solution 1:
Request
objects don't generate anything.
Scrapy uses asynchronous Downloader engine which takes these Request
objects and generate Response
objects.
if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response
object to specified callback
(i.e. Request(url, callback=self.my_callback)
).
Check out more at scrapy's architecture overview
Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:
self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider)
If you run this from a spider spider
here can most likely be self
here and self.crawler is inherited from scrapy.Spider
.
Alternatively you can always block asynchronous stack by using something like requests
like:
def parse(self, response):
image_url = response.xpath('//img/@href').extract_first()
if image_url:
image_head = requests.head(image_url)
if 'image' in image_head.headers['Content-Type']:
item['image'] = image_url
It will slow your spider down but it's significantly easier to implement and manage.
Solution 2:
If anyone is still interested in a possible solution – I managed it by doing a request with "requests" sort of "inside" a scrapy function like this:
importrequestsrequest_object= requests.get(the_url_you_like_to_get)
response_object = scrapy.Selector(request_object )
item['attribute'] = response_object .xpath('//path/you/like/to/get/text()').extract_first()
and then proceed.
Solution 3:
Scrapy uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Unless you are manually using a Downloader, it seems like the way you're using the framework is incorrect. I'd read a bit more about how you can create proper spiders here.
As for file exists, your spider can store relevant information in a database or other data structure when parsing the scraped data in its parse*() method, and you can later query it in your own code.
Post a Comment for "How To Fetch The Response Object Of A Request Synchronously On Scrapy?"