With OffsiteMiddleware you can control how to follow external links in Scrapy.
I want the spider to ignore all internal links on a site and follow external links only.
Dynamic rules to add the response URL domain to deny_domains didn't work.
Can you override get_host_regex in OffsiteMiddleware to filter out all onsite links? Any other way?
Clarification: I want the spider to ignore the domains defined in allowed_domains and all internal links on each domain crawled. So the domain of every URL followed by the spider must be ignored when the spider is on that URL. In other words: When the crawler reaches a site like example.com, I want it to ignore any links on example.com and only follow external links to sites that are not on example.com.
from Scrapy: follow external links only
No comments:
Post a Comment