Sunday, 27 December 2020

Scrapy crawler on Heroku returning 503 Service Unavailable

I have a scrapy crawler that scrapes data off a website and uploads the scraped data to a remote MongoDB server. I wanted to host it on heroku to scrape automatically for a long time. I am using scrapy-user-agents to rotate between different user agents. When I use scrapy crawl <spider> locally on my pc, the spider runs correctly and returns data to the MongoDB database.

However, when I deploy the project on heroku, I get the following lines in my heroku logs :

2020-12-22T12:50:21.132731+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://indiankanoon.org/browse/> (failed 1 times): 503 Service Unavailable

2020-12-22T12:50:21.134186+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36

(it fails similarly for 9 times until:)

2020-12-22T12:50:23.594655+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://indiankanoon.org/browse/> (failed 9 times): 503 Service Unavailable

2020-12-22T12:50:23.599310+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://indiankanoon.org/browse/> (referer: None)

2020-12-22T12:50:23.701386+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://indiankanoon.org/browse/>: HTTP status code is not handled or not allowed

2020-12-22T12:50:23.714834+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] INFO: Closing spider (finished)

In summary, my local IP address is able to scrape the data while when Heroku tries, it is unable to. Can changing something in the settings.py file correct it?

My settings.py file :

    BOT_NAME = 'indKanoon'
    
    SPIDER_MODULES = ['indKanoon.spiders']
    NEWSPIDER_MODULE = 'indKanoon.spiders'
    MONGO_URI = ''
    MONGO_DATABASE = 'casecounts'    
    ROBOTSTXT_OBEY = False
    CONCURRENT_REQUESTS = 32
    DOWNLOAD_DELAY = 3
    COOKIES_ENABLED = False
    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
    }
    ITEM_PIPELINES = {
   'indKanoon.pipelines.IndkanoonPipeline': 300,
}
    RETRY_ENABLED = True
    RETRY_TIMES = 8
    RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]


from Scrapy crawler on Heroku returning 503 Service Unavailable

No comments:

Post a Comment