Wednesday 10 March 2021

Unable to boost the performance while parsing links from landing pages

I'm trying to implement multiprocessing within the following script using concurrent.futures. The thing is even when I use concurrent.futures, performance is still the same. It doesn't seem to have any effect on the execution process, meaning it fails to boost the performance.

I know if I create another function and pass the links populated from get_titles() to that function in order to scrape title from their inner pages, I can make this concurrent.futures work. However, I wish to get the titles from landing pages using the function that I've created below.

I used iterative approach instead of recursion only because if I go for the latter, the function will throw recursion error when more than 1000 calls are made.

This is how I've tried so far (the site link that I've used within the script is a placeholder):

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
link = 'https://stackoverflow.com/questions/tagged/web-scraping'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    while True:
        res = requests.get(link,headers=headers)
        soup = BeautifulSoup(res.text,"html.parser")
        for item in soup.select(".summary > h3"):
            post_title = item.select_one("a.question-hyperlink").get("href")
            print(urljoin(base,post_title))

        next_page = soup.select_one(".pager > a[rel='next']")

        if not next_page: return
        link = urljoin(base,next_page.get("href"))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in [link]}
        futures.as_completed(future_to_url)

QUESTION:

How can I improve the performance while parsing links from landing pages?

EDIT: I know I can achieve the same following the route below but that is not what my initial attempt is like

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
links = ['https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary > h3"):
        post_title = item.select_one("a.question-hyperlink").get("href")
        print(urljoin(base,post_title))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in links}
        futures.as_completed(future_to_url)


from Unable to boost the performance while parsing links from landing pages

No comments:

Post a Comment