Tuesday, 19 October 2021

Unable to send requests in the right way after replacing redirected url with original one using middleware

I've created a script using scrapy to fetch some fields from a webpage. The url of the landing page and the urls of inner pages get redirected very often, so I created a middleware to handle that redirection. However, when I came across this post, I could understand that I need to return request in process_request() after replacing the redirected url with the original one.

This is meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]} always in place when the requests are sent from the spider.

As all the requests are not being redirected, I tried to replace the redirected urls within _retry() method.

def process_request(self, request, spider):
    request.headers['User-Agent'] = self.ua.random

def process_exception(self, request, exception, spider):
    return self._retry(request, spider)

def _retry(self, request, spider):
    request.dont_filter = True
    if request.meta.get('redirect_urls'):
        redirect_url = request.meta['redirect_urls'][0]
        redirected = request.replace(url=redirect_url)
        redirected.dont_filter = True
        return redirected
    return request

def process_response(self, request, response, spider):
    if response.status in [301, 302, 307, 429]:
        return self._retry(request, spider)
    return response

Question: How can I send requests after replacing redirected url with original one using middleware?



from Unable to send requests in the right way after replacing redirected url with original one using middleware

No comments:

Post a Comment