Saturday, 29 December 2018

Unable to rectify the logic within my script to make it stop when it's done

I've written a script in python using proxies to scrape the links of different posts traversing different pages of a webpage. I've tried to make use of proxies from a list. The script is supposed to take random proxies from the list and send request to that website and finally parse the items. However, if any proxy is not working then it should be kicked out from the list.

My script is doing it's job in a faulty way, meaning it just keeps parsing on and on until all the proxies in the list are exhausted whereas the links have already been parsed.

What I'm trying to do is bring about any change within my script so that it will break as soon as the links are parsed no matter if there are still proxies in the list otherwise the script will keep scraping on the same items repeatedly.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from itertools import cycle

base_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
lead_url = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=15".format(page) for page in range(1,6)]

proxyVault = ['104.248.159.145:8888', '113.53.83.252:54356', '206.189.236.200:80', '218.48.229.173:808', '119.15.90.38:60622', '186.250.176.156:42575']

def make_requests(lead_url):
    while len(proxyVault)>0:   
        pitem = cycle(proxyVault)
        proxy = {'https':'http://{}'.format(next(pitem))}
        try:
            res = requests.get(lead_url,proxies=proxy)
            soup = BeautifulSoup(res.text,"lxml")
            [get_title(proxy,urljoin(base_url,item.get("href"))) for item in soup.select(".summary .question-hyperlink")]
        except Exception: 
            proxyVault.pop(0)

def get_title(proxy,itemlink):
    res = requests.get(itemlink,proxies=proxy)
    soup = BeautifulSoup(res.text,"lxml")
    print(soup.select_one("h1[itemprop='name'] a").text)

if __name__ == '__main__':
    ThreadPool(10).map(make_requests, lead_url)

Btw, the proxies used above are just placeholders.



from Unable to rectify the logic within my script to make it stop when it's done

1 comment:

  1. NZeTA for New Zealand
    On August 2019, the Government of New Zealand implemented the NZeTA to facilitate obtaining an authorization to visit the country for Tourism and Transit purposes. The new online process is valid for 2 years and allows multiple entries to New Zealand.

    NZeTa can be easily obtained online, simplifying the process to enter New Zealand by offering travelers the option to apply directly online for a travel authorization.

    The eTA for New Zealand grants its holder multiple entries to the country during its 2-year validity. Holders of the NZeTA are generally allowed to stay in the country for stays of up to 90 days from their date of arrival in New Zealand. Passports must have a minimum validity of at least three (3) months from the expected date of departure from New Zealand.

    Visitors traveling to New Zealand for short-term stays can apply to have a NZeTA to travel to and within the country. To be able to obtain a valid eTA travel authority to visit New Zealand, travelers should complete an NZeTA online application. The online New Zealand eTA application is straightforward and simple to complete. To get the New Zealand eTA, citizens of eligible countries are required to carefully fill out an online application form.

    ReplyDelete