Friday 20 November 2020

Unable to rectify the logic of grabbing next page links to make the execution faster

I'm trying to create a script using concurrent.futures to make the execution faster. The site link that I've used within this script is a placeholder but the logic is the same.

I'm trying to let the script parse the target links from it's landing page and then use those newly scraped links to fetch the required information from their inner pages. There is a pagination button in the landing page which leads to the next pages. FYI, there is no highest page number associated with the next page button, so I've to stick with next page link like the way I've shown below.

The way the following script is going for the next pages is slowing the process down.

Here is what I've tried so far with:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary .question-hyperlink"):
        target_link = base.format(item.get("href"))
        yield target_link

    next_page = soup.select_one("a[rel='next']:contains('Next')")
    if next_page:
        next_page_link = base.format(next_page.get("href"))
        yield from get_links(next_page_link)

def get_info(target_link):
    res = requests.get(target_link)
    soup = BeautifulSoup(res.text,"html.parser")
    title = soup.select_one("h1[itemprop='name'] > a").get_text(strip=True)
    user_name =  soup.select_one(".user-details[itemprop='author'] > a").get_text(strip=True)
    return user_name,title

if __name__ == '__main__':
    base = "https://stackoverflow.com{}"
    url = "https://stackoverflow.com/questions/tagged/web-scraping"

    with ThreadPoolExecutor(max_workers=6) as executor:
        for r in [executor.submit(get_info, item) for item in get_links(url)]:
            print(r.result())

What type of change should I bring about within the script to make it run faster?



from Unable to rectify the logic of grabbing next page links to make the execution faster

No comments:

Post a Comment