Friday 27 November 2020

Can't fetch all the titles from a webpage

I'm trying to parse all the categories and their nested categories recursivelly from this webpage which ultimately leads to such page and finally this innermost page from where I would like to fetch all the product titles.

The script can follow the above steps. However, when it comes to fetch all the titles from result pages traversing all next pages, the script gets fewer content than how many there are.

This is what I've written:

class mySpider(scrapy.Spider):
    name = "myspider"

    start_urls = ['https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/subcategory_pages/Cables_P-10/e3a9792d-bafa-4e89-8e3f-8b1a45bd2682']
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

    def parse(self,response):
        cookie = response.headers.getlist('Set-Cookie')[1].decode().split(";")[0]
        for item in response.xpath("//div[./h3[contains(.,'Category')]]/ul/li/a/@href").getall():
            item_link = response.urljoin(item.strip())
            if "/products/list_pages/" in item_link:
                yield scrapy.Request(item_link,headers=self.headers,meta={'cookiejar': cookie},callback=self.parse_all_links)
            else:
                yield scrapy.Request(item_link,headers=self.headers,meta={'cookiejar': cookie},callback=self.parse)


    def parse_all_links(self,response):
        for item in response.css("[class='pxc-sales-data-wrp'][data-product-key] h3 > a[href][onclick]::attr(href)").getall():
            target_link = response.urljoin(item.strip())
            yield scrapy.Request(target_link,headers=self.headers,meta={'cookiejar': response.meta['cookiejar']},callback=self.parse_main_content)

        next_page = response.css("a.pxc-pager-next::attr(href)").get()
        if next_page:
            base_url = response.css("base::attr(href)").get()
            next_page_link = urljoin(base_url,next_page)
            yield scrapy.Request(next_page_link,headers=self.headers,meta={'cookiejar': response.meta['cookiejar']},callback=self.parse_all_links)


    def parse_main_content(self,response):
        item = response.css("h1::text").get()
        print(item)

How can I get all the titles available in that category?

The script gets different number of results every time I run it.



from Can't fetch all the titles from a webpage

No comments:

Post a Comment