Friday 5 March 2021

Can't figure out any efficient way to avoid being banned while scraping data from a login based site

I'm trying to create a script using which I can parse few fields from a website without getting blocked. The site I wish to get data from requires credentials to access it's content. If it were not for login thing, I could have bypassed the rate limit using rotation of proxies.

As I'm scraping content from a login based site, I'm trying to figure out any way to avoid being banned by that site while scraping data from there. To be specific, my script currently can fetch content from that site flawlessly but my ip address gets banned along the way if I keep on scraping.

I've written so far (consider the following site address to be a placeholder):

import requests
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    req = s.get(url)

    payload = {
        "fkey": BeautifulSoup(req.text,"lxml").select_one("[name='fkey']")["value"],
        "email": "some email",
        "password": "some password",
    }
    
    res = s.post(url,data=payload)
    soup = BeautifulSoup(res.text,"lxml")
    for post_title in soup.select(".summary > h3 > a.question-hyperlink"):
        print(post_title.text)

How can I avoid being banned while scraping data from a login based site?



from Can't figure out any efficient way to avoid being banned while scraping data from a login based site

No comments:

Post a Comment