Sunday 23 August 2020

Can't use https proxies along with reusing the same session within a script built upon asyncio

I'm trying to use https proxy within async requests making use of asyncio library. When it comes to use http proxy, there is a clear instruction here but I get stuck in case of using https proxy. Moreover, I would like to reuse the same session, not creating a new session every time I send a requests.

I've tried so far (proxies used within the script are directly taken from a free proxy site, so consider them as placeholders):

import asyncio
import aiohttp
from bs4 import BeautifulSoup

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

async def get_text(url):
    global proxies,proxy_url
    while True:
        check_url = proxy_url
        proxy = f'http://{proxy_url}'
        print("trying using:",check_url)
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(url,proxy=proxy,ssl=False) as resp:
                    return await resp.text()
            except Exception:
                if check_url == proxy_url:
                    proxy_url = proxies.pop()

async def field_info(field_link):              
    text = await get_text(field_link)          
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

if __name__ == '__main__':
    proxy_url = proxies.pop()
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
    loop.run_until_complete(future)
    loop.close()

How can I use https proxies within the script along with reusing the same session?



from Can't use https proxies along with reusing the same session within a script built upon asyncio

No comments:

Post a Comment