I have a small eBay webscraping project, in which I am collecting data on Pokemon Trading Card Games (TCG). I have a friendly scraper which only puts forth a request every 20 seconds or more. I was wondering if eBay shows me all items available, or a subset thereof, and unfortunately the latter is the case. I reviewed their documentation and alternative resources, but am unable to find in-depth information regarding their algorithm.
Based on geographical location, eBay filters the resultset and returns only items they! regard of importance for me. For this specific example, some countries receive an additional 40.000 items, which makes me feel upset and is giving me anxiety of missing out.
Preferable, I would not want to retrieve this additional data by means of a proxy, but directly by means of an altered URL, which does return the complete resultset, irrespective of geographical location, allowing me to use their front-end interface.
For example, I have tried to add the Region tag to my URL, but with no marked effect: &Region=Europe%7CAustralia%252C%2520Oceania%7CAsia%7CAntarctica%7CAfrica%7CMiddle%2520East%7CNorth%2520America%7CSouth%2520America Similarly, I have tried to add the Location tag (worldwide) but instead this decreased the number of retrieved results: &LH_PrefLoc=2
Is there a way to overcome their resultset limitation? Perhaps by setting specific request headers or different URL modifications? I feel like I'm hitting a wall.
#Below a code snippet to show the general methodology of retrieved resultsset
#Lets run over proxies, retrieve specific eBay URL, and check how many results are returned per proxy.
#We will scrape the Pokemon Trading Card Game (TCG) items.
base_url = 'https://www.ebay.com/b/Pokemon-TCG/2536/bn_7117595258?LH_Auction=1&rt=nc&_sop=5'
adj_url_1 = 'https://www.ebay.com/b/Pokemon-TCG/2536/bn_7117595258?LH_Auction=1&rt=nc&Region=Europe%7CAustralia%252C%2520Oceania%7CAsia%7CAntarctica%7CAfrica%7CMiddle%2520East%7CNorth%2520America%7CSouth%2520America&_sop=5'
adj_url_2 = 'https://www.ebay.com/b/Pokemon-TCG/2536/bn_7117595258?LH_Auction=1&rt=nc&LH_PrefLoc=2&_sop=5'
cookies={}
country_results=[]
#Iterate over proxies
for proxy in tqdm(ebay_proxies[1:]):
try:
header=header_generator()
proxy_country=get_location(proxy[0]["https"].split(":")[0], header, proxy[0], cookies)["country"]
SP={}
for key, val in {"base_url": base_url, "adj_url_1" : adj_url_1, "adj_url_2" : adj_url_2}.items():
soup, header, session = retrieve_soup(URL=val ,proxy=proxy[0], headers=header, cookies=cookies)
results = soup.find('h2', {'class': 'srp-controls__count-heading'}).text
SP[key] = results
SP["country_proxy"] = proxy_country
country_results.append(SP)
except:
print("proxy failed")
After filtering for duplicated countries, this resulted in the following resultsets:
base_url adj_url_1 adj_url_2 country_proxy
0 168,113 Results 168,112 Results 169,021 Results Germany
1 209,539 Results 209,533 Results 72,828 Results Finland
13 203.499 resultados 203.502 resultados 25.995 resultados Argentina
16 205.693 resultados 205.689 resultados 52.273 resultados Bolivia
17 203.051 resultados 203.053 resultados 30.790 resultados Peru
19 179,187 Results 179,188 Results 180,092 Results United States
25 164,360 Results 164,358 Results 165,297 Results India
I can't place full source code, as I do not want to share my proxies for the obvious reasons. However, you can test the base_url, and see yourself how many results are returned and if it can be improved!
from eBay webscraping from different geographical locations is returning inconsistent number of results
No comments:
Post a Comment