So I'm trying to scrape data from archived versions of a website using Scrapy. Here is my code:
import scrapy
from scrapy.crawler import *
from scrapy.item import *
from scrapy.linkextractors import *
from scrapy.loader import *
from scrapy.spiders import *
from scrapy.utils.log import *
from scrapy.utils.project import *
try:
from urllib.parse import urlparse
except ImportError:
from urlparse import urlparse
class VSItem(Item):
value = Field()
class vsSpider(scrapy.Spider):
name = "lever"
start_urls = [
"https://web.archive.org/web/20051120125133/http://www.novi.k12.mi.us/default.aspx"
]
rules = (
Rule(
LinkExtractor(allow="https:\/\/web.archive.org\/web\/\d{14}\/http:\/\/www.novi.k12.mi.us\/.*"),
callback="parse"
),
)
def parse(self, response):
for elem in response.xpath("/html"):
it = VSItem()
it["value"] = elem.css("input[name='__VIEWSTATE']").extract()
yield it
process = CrawlerProcess(get_project_settings())
process.crawl(vsSpider)
process.start() # the script will block here until the crawling is finished
I set the start_urls
to https://web.archive.org/web/20051120125133/http://www.novi.k12.mi.us/
since that is the earliest archived version of the page.
This script extracts the element I want from the page listed, but then stops here.
My question is: How can I automatically crawl every single archive of both the homepage (/default.aspx) and every sub-directory of the main site (e.g. not just /default.aspx but also, for example, /Schools/noviHigh/default.aspx and everything else)? (Basically loop through every possible URL that matches /https:\/\/web.archive.org\/web\/.\d{14}/http:\/\/www.novi.k12.mi.us\/.*/g
, the \d{14}
is because the date stamp is in the form YYYYMMDDHHmmSS
)
from Scrapy - crawling archives of website plus all subdirectories
No comments:
Post a Comment