Friday 30 August 2019

Scrapy does not fetch markup on response.css

I've built a simple scrapy spider running on scrapinghub:

class ExtractionSpider(scrapy.Spider):
    name = "extraction"
    allowed_domains = ['domain']
    start_urls = ['http://somedomainstart']
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"

    def parse(self, response):
        urls = response.css('a.offer-details__title-link::attr(href)').extract()

        print(urls)
        for url in urls:
            url = response.urljoin(url)
            yield SplashRequest(url=url, callback=self.parse_details)

        multiple_locs_urls = response.css('a.offer-regions__label::attr(href)').extract()
        print(multiple_locs_urls)        
        for url in multiple_locs_urls:
            url = response.urljoin(url)
            yield SplashRequest(url=url, callback=self.parse_details)

        next_page_url = response.css('li.pagination_element--next > a.pagination_trigger::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield SplashRequest(url=next_page_url, callback=self.parse)

    def parse_details(self, response): 
        yield {
        'title': response.css('#jobTitle').extract_first(),
        'content': response.css('#description').extract_first(),
        'datePosted': response.css('span[itemprop="datePosted"]').extract_first(),
        'address': response.css('span[itemprop="address"]').extract_first()
        }

The problem I am facing is that the multiple_locs_url response.css returns an empty array despite me seeing it in the markup on the browser side.

I checked with scrapy shell and scrapy shell does not see the markup. I guess this is due to the markup being rendered through javascript when the page is loaded.

I added splash but that does not seem to apply to response. How would I make scrapy wait with the query until the page is loaded?



from Scrapy does not fetch markup on response.css

No comments:

Post a Comment