Hemant Vishwakarma: Trouble parsing product names out of some links with different depth

Friday, 31 August 2018

Trouble parsing product names out of some links with different depth

I've written a script in python to reach the target page where each category has their avaiable item names in a website. My below script can get the product names from most of the links (generated through roving category links and then subcategory links).

The script can parse sub-category links revealed upon clicking + sign located right next to each category which are visible in the below image and then parse all the product names from the target page. This is one of such target pages.

However, few of the links do not have the same depth as other links. For example this link and this one are different from usual links like this one.

How can I get all the product names from all the links irrespective of their different depth?

This is what I've tried so far:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

link = "https://www.courts.com.sg/"

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".nav-dropdown li a"):
    if "#" in item.get("href"):continue  #kick out invalid links
    newlink = urljoin(link,item.get("href"))
    req = requests.get(newlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for elem in sauce.select(".product-item-info .product-item-link"):
        print(elem.get_text(strip=True))

How to find trget links:

from Trouble parsing product names out of some links with different depth

Hemant Vishwakarma

Friday, 31 August 2018

Trouble parsing product names out of some links with different depth

No comments:

Post a Comment