I'm trying to scrape this site:
http://www.occeweb.com/MOEAsearch/index.aspx
If I search for "A", I get multiple pages.
I can get the results of the 1st page fine, using:
url = 'http://www.occeweb.com/MOEAsearch/index.aspx'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
vs = soup.find('input',{'id':'__VIEWSTATE'}).attrs['value']
ev = soup.find('input',{'id':'__EVENTVALIDATION'}).attrs['value']
cookies = {
'ASP.NET_SessionId': 'f1vztt45bdcvzr45jkrbcoru',
}
headers = {
'Proxy-Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Origin': 'http://www.occeweb.com',
'Upgrade-Insecure-Requests': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.143 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': url,
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
}
data = {
'__EVENTTARGET': 'gvResults',
'__EVENTARGUMENT': '',
'__VIEWSTATE': vs,
'__VIEWSTATEGENERATOR': '2E193097',
'__EVENTVALIDATION': ev,
'txtSearch': 'A',
'StartsEnds': 'rbBeginswith',
'TxtSearchFirst': '',
'btnSearch':'Search'
}
r = requests.post(url, headers=headers, cookies=cookies, data=data)
soup = BeautifulSoup(r.text,'html.parser')
However, when I try to use the same __VIEWSTATE and __EVENTVALIDATION for the 2nd page, it doesn't work.
I have also tried pulling the __VIEWSTATE from the response of the POST request and using that in the subsequent call, no luck.
Note that I am able to get this to work for the first 11 pages of results by simply copying the __VIEWSTATE and __EVENTVALIDATION from chrome dev tools on page 1 and holding it static (have to remove 'btnSearch':'Search' for pages after 1 for some reason).
However this static __VIEWSTATE and __EVENTVALIDATION fail on page 12. When I copy the page 12 curl, it works until page 22, then page 32, 42 and so on. So it seems the __VIEWSTATE needs to be updated once every 10 pages or so.
Problem is, the __VIEWSTATE I pull from the result of the POST request does not work, and I can't GET the updated __VIEWSTATE I need.
Thanks for you help!
from Python Scraping .aspx multiple pages, multiple __VIEWSTATES
No comments:
Post a Comment