I am trying to read a web page using a get request in python. The original URL is given here. I found out that the information I am interested in is in a subpage with this URL (I replaced the authenticity token with XXX).
I tried using the second URL in my script but I get a 406 error. Can you suggest what am I doing wrong? Is the authenticity token for preventing scraping? if so, can I work around it?
import urllib.request
url = ...
agent={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = urllib.request.Request(url,headers=agent)
data = urllib.request.urlopen(req)
Thanks!
PS, This is how I get the URL using Chrome:
First I browse to https://www.goodreads.com/book/show/385228.On_Liberty
Then I open Chrome's developer tools: three dots -> more tools -> developer tools. Choose the network tab.
Then I go to the bottom of the page (just after the last review) and click "next".
In the tool window choose the request and in the header I get the Request URL: https://www.goodreads.com/book/reviews/385228?csm_scope=&hide_last_page=true&language_code=en&page=2&authenticity_token=XXX
from Getting a URL with an authenticity token using python
No comments:
Post a Comment