The requests module isn't working anymore for me when trying to scrape amazon, I've tried using cookies, headers, changing IP's but nothing really works other than scraping through a browser. Does anyone know how they're able to do it and if there's a good work around using requests?
The real odd thing is that the request when sent through cURL returns the page, but if I turn it into python code it returns a captcha request that I can't see in my browser and doesn't go away even with cookies.
For example this cURL request returns the Amazon main page, but when truend into python it returns a captcha request:
curl -L -vvv http://amazon.com -H "User-Agent:Mozilla 5.0"
This is my current code, I copied the curl request directly from the browser and turned into python code, still not working:
import requests
cookies = {
'session-id': '135-4585428-6195300',
'session-id-time': '2082787201l',
'i18n-prefs': 'USD',
'sp-cdn': '"L5Z9:IL"',
'ubid-main': '132-1503580-7678418',
'session-token': 'R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L',
'csm-hit': 'tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
}
headers = {
'authority': 'www.amazon.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
# 'cookie': 'session-id=135-4585428-6195300; session-id-time=2082787201l; i18n-prefs=USD; sp-cdn="L5Z9:IL"; ubid-main=132-1503580-7678418; session-token=R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L; csm-hit=tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
'device-memory': '8',
'downlink': '10',
'dpr': '1',
'ect': '4g',
'rtt': '100',
'sec-ch-device-memory': '8',
'sec-ch-dpr': '1',
'sec-ch-ua': '"Chromium";v="116", "Not)A;Brand";v="24", "Microsoft Edge";v="116"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-ch-ua-platform-version': '"10.0.0"',
'sec-ch-viewport-width': '1037',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.54',
'viewport-width': '1037',
}
response = requests.get('https://www.amazon.com/dp/B002G9UDYG', cookies=cookies, headers=headers)
from Problem scraping Amazon using requests: I get blocked even when using cookie and headers. I can only scrape using a browser. Any solution?
No comments:
Post a Comment