Monday 28 August 2023

Scrapy - crawling archives of website plus all subdirectories

So I'm trying to scrape data from archived versions of a website using Scrapy. Here is my code:

import scrapy
from scrapy.crawler import *
from scrapy.item import *
from scrapy.linkextractors import *
from scrapy.loader import *
from scrapy.spiders import *
from scrapy.utils.log import *
from scrapy.utils.project import *

try:
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse
    
class VSItem(Item):
    value = Field()

class vsSpider(scrapy.Spider):
    name = "lever"
    start_urls = [
        "https://web.archive.org/web/20051120125133/http://www.novi.k12.mi.us/default.aspx"
    ]
    rules = (
            Rule(
                LinkExtractor(allow="https:\/\/web.archive.org\/web\/\d{14}\/http:\/\/www.novi.k12.mi.us\/.*"),
                callback="parse"
                ),
            )

    def parse(self, response):
        for elem in response.xpath("/html"):
            it = VSItem()
            it["value"] = elem.css("input[name='__VIEWSTATE']").extract()
            yield it
 
process = CrawlerProcess(get_project_settings())

process.crawl(vsSpider)
process.start() # the script will block here until the crawling is finished

I set the start_urls to https://web.archive.org/web/20051120125133/http://www.novi.k12.mi.us/since that is the earliest archived version of the page.

This script extracts the element I want from the page listed, but then stops here.

My question is: How can I automatically crawl every single archive of both the homepage (/default.aspx) and every sub-directory of the main site (e.g. not just /default.aspx but also, for example, /Schools/noviHigh/default.aspx and everything else)? (Basically loop through every possible URL that matches /https:\/\/web.archive.org\/web\/.\d{14}/http:\/\/www.novi.k12.mi.us\/.*/g, the \d{14} is because the date stamp is in the form YYYYMMDDHHmmSS)



from Scrapy - crawling archives of website plus all subdirectories

No comments:

Post a Comment