Monday 4 January 2021

Can't scrape all the company names from a webpage

I'm trying to parse all the company names from this webpage. There are around 2431 companies in there. However, the way I've tried below can fetches me 1000 results.

This is what I can see about the number of results in response while going through dev tools:

hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431      <------------------------       
nbPages: 1
page: 0

How can I get the rest of the results using requests?

I've tried so far:

import requests

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    r = s.post(url,params=params,json=payload)
    print(len(r.json()['results'][0]['hits']))


from Can't scrape all the company names from a webpage

No comments:

Post a Comment