Skip to content Skip to sidebar Skip to footer

How To Fetch The Response Object Of A Request Synchronously On Scrapy?

I believe using 'callback' method is asynchronous, please correct me if I'm wrong. I'm still new with Python so please bear with me. Anyway, I'm trying to make a method to check if

Solution 1:

Request objects don't generate anything.

Scrapy uses asynchronous Downloader engine which takes these Request objects and generate Response objects.

if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response object to specified callback(i.e. Request(url, callback=self.my_callback)). Check out more at scrapy's architecture overview

Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:

self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider) 

If you run this from a spider spider here can most likely be self here and self.crawler is inherited from scrapy.Spider.

Alternatively you can always block asynchronous stack by using something like requests like:

def parse(self, response):
    image_url = response.xpath('//img/@href').extract_first()
    if image_url:
        image_head = requests.head(image_url)
        if 'image' in image_head.headers['Content-Type']:
            item['image'] = image_url

It will slow your spider down but it's significantly easier to implement and manage.

Solution 2:

If anyone is still interested in a possible solution – I managed it by doing a request with "requests" sort of "inside" a scrapy function like this:

importrequestsrequest_object= requests.get(the_url_you_like_to_get)
response_object = scrapy.Selector(request_object )
item['attribute'] = response_object .xpath('//path/you/like/to/get/text()').extract_first()

and then proceed.

Solution 3:

Scrapy uses Request and Response objects for crawling web sites.

Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

Unless you are manually using a Downloader, it seems like the way you're using the framework is incorrect. I'd read a bit more about how you can create proper spiders here.

As for file exists, your spider can store relevant information in a database or other data structure when parsing the scraped data in its parse*() method, and you can later query it in your own code.

Post a Comment for "How To Fetch The Response Object Of A Request Synchronously On Scrapy?"