I've written a script in python using proxies to scrape the links of different posts traversing different pages of a webpage. I've tried to make use of proxies
from a list. The script is supposed to take random proxies
from the list and send request to that website and finally parse the items. However, if any proxy
is not working then it should be kicked out from the list.
My script is doing it's job in a faulty way, meaning it just keeps parsing on and on until all the proxies
in the list are exhausted whereas the links have already been parsed.
What I'm trying to do is bring about any change within my script so that it will break as soon as the links are parsed no matter if there are still proxies in the list otherwise the script will keep scraping on the same items repeatedly.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from itertools import cycle
base_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
lead_url = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=15".format(page) for page in range(1,6)]
proxyVault = ['104.248.159.145:8888', '113.53.83.252:54356', '206.189.236.200:80', '218.48.229.173:808', '119.15.90.38:60622', '186.250.176.156:42575']
def make_requests(lead_url):
while len(proxyVault)>0:
pitem = cycle(proxyVault)
proxy = {'https':'http://{}'.format(next(pitem))}
try:
res = requests.get(lead_url,proxies=proxy)
soup = BeautifulSoup(res.text,"lxml")
[get_title(proxy,urljoin(base_url,item.get("href"))) for item in soup.select(".summary .question-hyperlink")]
except Exception:
proxyVault.pop(0)
def get_title(proxy,itemlink):
res = requests.get(itemlink,proxies=proxy)
soup = BeautifulSoup(res.text,"lxml")
print(soup.select_one("h1[itemprop='name'] a").text)
if __name__ == '__main__':
ThreadPool(10).map(make_requests, lead_url)
Btw, the proxies
used above are just placeholders.
from Unable to rectify the logic within my script to make it stop when it's done