I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.
However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.
My question: how can I reduce the execution time?
This is my try:
import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver
def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
return titles
def get_title(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
driver.get(url)
sauce = BeautifulSoup(driver.page_source,"lxml")
item = sauce.select_one("h1 a").text
print(item)
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
ThreadPool(5).map(get_title,get_links(url))
from Python selenium multiprocessing
Thanks, Experience with various technologies and businesses this is generally helpful.
ReplyDeleteStill, I followed step-by-step your method in this selenium training
selenium certification
selenium online training Hyderabad
selenium online courses