Sunday, 2 April 2023

eBay webscraping from different geographical locations is returning inconsistent number of results

I have a small eBay webscraping project, in which I am collecting data on Pokemon Trading Card Games (TCG). I have a friendly scraper which only puts forth a request every 20 seconds or more. I was wondering if eBay shows me all items available, or a subset thereof, and unfortunately the latter is the case. I reviewed their documentation and alternative resources, but am unable to find in-depth information regarding their algorithm.

Based on geographical location, eBay filters the resultset and returns only items they! regard of importance for me. For this specific example, some countries receive an additional 40.000 items, which makes me feel upset and is giving me anxiety of missing out.

Preferable, I would not want to retrieve this additional data by means of a proxy, but directly by means of an altered URL, which does return the complete resultset, irrespective of geographical location, allowing me to use their front-end interface.

For example, I have tried to add the Region tag to my URL, but with no marked effect: &Region=Europe%7CAustralia%252C%2520Oceania%7CAsia%7CAntarctica%7CAfrica%7CMiddle%2520East%7CNorth%2520America%7CSouth%2520America Similarly, I have tried to add the Location tag (worldwide) but instead this decreased the number of retrieved results: &LH_PrefLoc=2

Is there a way to overcome their resultset limitation? Perhaps by setting specific request headers or different URL modifications? I feel like I'm hitting a wall.

#Below a code snippet to show the general methodology of retrieved resultsset
#Lets run over proxies, retrieve specific eBay URL, and check how many results are returned per proxy.
#We will scrape the Pokemon Trading Card Game (TCG) items.
base_url = 'https://www.ebay.com/b/Pokemon-TCG/2536/bn_7117595258?LH_Auction=1&rt=nc&_sop=5'
adj_url_1 = 'https://www.ebay.com/b/Pokemon-TCG/2536/bn_7117595258?LH_Auction=1&rt=nc&Region=Europe%7CAustralia%252C%2520Oceania%7CAsia%7CAntarctica%7CAfrica%7CMiddle%2520East%7CNorth%2520America%7CSouth%2520America&_sop=5'
adj_url_2 = 'https://www.ebay.com/b/Pokemon-TCG/2536/bn_7117595258?LH_Auction=1&rt=nc&LH_PrefLoc=2&_sop=5'
cookies={}
country_results=[]

#Iterate over proxies
for proxy in tqdm(ebay_proxies[1:]):
    try:
        header=header_generator() 
        proxy_country=get_location(proxy[0]["https"].split(":")[0], header, proxy[0], cookies)["country"]
        SP={}
        for key, val in {"base_url": base_url, "adj_url_1" : adj_url_1, "adj_url_2" : adj_url_2}.items():
            soup, header, session = retrieve_soup(URL=val ,proxy=proxy[0], headers=header, cookies=cookies)
            results = soup.find('h2', {'class': 'srp-controls__count-heading'}).text
            SP[key] = results
        SP["country_proxy"] = proxy_country
        country_results.append(SP)
    except:
        print("proxy failed")

After filtering for duplicated countries, this resulted in the following resultsets:

    base_url            adj_url_1           adj_url_2           country_proxy
0   168,113 Results     168,112 Results     169,021 Results     Germany
1   209,539 Results     209,533 Results     72,828 Results      Finland
13  203.499 resultados  203.502 resultados  25.995 resultados   Argentina
16  205.693 resultados  205.689 resultados  52.273 resultados   Bolivia
17  203.051 resultados  203.053 resultados  30.790 resultados   Peru
19  179,187 Results     179,188 Results     180,092 Results     United States
25  164,360 Results     164,358 Results     165,297 Results     India

I can't place full source code, as I do not want to share my proxies for the obvious reasons. However, you can test the base_url, and see yourself how many results are returned and if it can be improved!



from eBay webscraping from different geographical locations is returning inconsistent number of results

No comments:

Post a Comment