Wednesday, 2 October 2019

Scrapy: 1) What is throttling the spider 2) How to delete aged request objects?

I understand this is inherently a difficult question to answer without knowing what the full code is.

Here is a summary of the code:

  • Starts with lots of start-urls
  • Saves some data to the metadata
  • Has a global dictionary variable, initiated with current time
  • Follows all links that share the same domain
  • Checks current time with initiation time and does not follow when the difference is too high. Some domains have too many links (How to limit scrapy request objects?)

But generally, what causes such a non smooth curve for requests? I don’t have much processing so I would have thought they will basically make as many requests as is allowed under concurrent requests. Furthermore I have set concurrent requests to 256 and request per IP = 1 so it shouldn’t be because of any specific website. The other interesting thing is it starts off fast and slows down. That I can understand as the slow requests stay in the system longer, and drag down the average.

Any thoughts would be welcome.

enter image description here

[scrapy.crawler] Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'BOT_NAME': 'stack', 'CONCURRENT_REQUESTS': 256, 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS_PER_IP': 1, 'DEPTH_LIMIT': 3, 'DEPTH_PRIORITY': 1, 'DOWNLOAD_DELAY': 1, 'LOG_ENABLED': False, 'LOG_LEVEL': 'INFO', 'MEMUSAGE_LIMIT_MB': 950, 'NEWSPIDER_MODULE': 'stack.spiders', 'REACTOR_THREADPOOL_MAXSIZE': 30, 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'SPIDER_MODULES': ['stack.spiders'], 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'TELNETCONSOLE_HOST': '0.0.0.0',}

Update I tried setting the concurrent request to inf, and that resolved the linearity issue but created a secondary problem.

Based on this update is it more obvious what is throttling the spider?

However, now the scheduled requests are higher than the actual requests by about 20%, resulting in memory usage to grow as the program continues. Is there a way to delete aged request objects so that memory usage does not grow over time?



from Scrapy: 1) What is throttling the spider 2) How to delete aged request objects?

No comments:

Post a Comment