Tuesday 29 August 2023

Problem scraping Amazon using requests: I get blocked even when using cookie and headers. I can only scrape using a browser. Any solution?

The requests module isn't working anymore for me when trying to scrape amazon, I've tried using cookies, headers, changing IP's but nothing really works other than scraping through a browser. Does anyone know how they're able to do it and if there's a good work around using requests?

The real odd thing is that the request when sent through cURL returns the page, but if I turn it into python code it returns a captcha request that I can't see in my browser and doesn't go away even with cookies.

For example this cURL request returns the Amazon main page, but when truend into python it returns a captcha request:

curl -L -vvv http://amazon.com -H "User-Agent:Mozilla 5.0"

This is my current code, I copied the curl request directly from the browser and turned into python code, still not working:

import requests

cookies = {
    'session-id': '135-4585428-6195300',
    'session-id-time': '2082787201l',
    'i18n-prefs': 'USD',
    'sp-cdn': '"L5Z9:IL"',
    'ubid-main': '132-1503580-7678418',
    'session-token': 'R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L',
    'csm-hit': 'tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
}

headers = {
    'authority': 'www.amazon.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    # 'cookie': 'session-id=135-4585428-6195300; session-id-time=2082787201l; i18n-prefs=USD; sp-cdn="L5Z9:IL"; ubid-main=132-1503580-7678418; session-token=R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L; csm-hit=tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
    'device-memory': '8',
    'downlink': '10',
    'dpr': '1',
    'ect': '4g',
    'rtt': '100',
    'sec-ch-device-memory': '8',
    'sec-ch-dpr': '1',
    'sec-ch-ua': '"Chromium";v="116", "Not)A;Brand";v="24", "Microsoft Edge";v="116"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-ch-ua-platform-version': '"10.0.0"',
    'sec-ch-viewport-width': '1037',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.54',
    'viewport-width': '1037',
}

response = requests.get('https://www.amazon.com/dp/B002G9UDYG', cookies=cookies, headers=headers)


from Problem scraping Amazon using requests: I get blocked even when using cookie and headers. I can only scrape using a browser. Any solution?

No comments:

Post a Comment