Saturday, 6 February 2021

Web Scraping - Identifying and executing a request ; troubleshooting

I am having some trouble scraping data from the following website: https://www.loft.com.br/apartamentos/sao-paulo-sp?q=pin

When we load the page, it loads the first ~30 posts on real state in the city of Sao Paulo. If we scroll down, it loads more posts.

Usually I would use selenium to get around this - but I want to learn how to do it properly - I imagine that is by fiddling with requests.

By using inspect on chrome, and watching for what happens when we scroll down, I can see a request made which I presume is what retrieves the new posts.

enter image description here

If I copy its content as curl, I get the following command:

curl "https://landscape-api.loft.com.br/listing/search?city=S^%^C3^%^A3o^%^20Paulo^&facetFilters^\[^\]=address.city^%^3AS^%^C3^%^A3o^%^20Paulo^&limit=18^&limitedColumns=true^&loftUserId=417b37df-19ab-4014-a800-688c5acc039d^&offset=28^&orderBy^\[^\]=rankB^&orderByStatus=^%^27FOR_SALE^%^27^%^2C^%^20^%^27JUST_LISTED^%^27^%^2C^%^20^%^27DEMOLITION^%^27^%^2C^%^20^%^27COMING_SOON^%^27^%^20^%^2C^%^20^%^27SOLD^%^27^&originType=LISTINGS_LOAD_MORE^&q=pin^&status^\[^\]=FOR_SALE^&status^\[^\]=JUST_LISTED^&status^\[^\]=DEMOLITION^&status^\[^\]=COMING_SOON^&status^\[^\]=SOLD" ^
  -X "OPTIONS" ^
  -H "Connection: keep-alive" ^
  -H "Accept: */*" ^
  -H "Access-Control-Request-Method: GET" ^
  -H "Access-Control-Request-Headers: loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id" ^
  -H "Origin: https://www.loft.com.br" ^
  -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" ^
  -H "Sec-Fetch-Mode: cors" ^
  -H "Sec-Fetch-Site: same-site" ^
  -H "Sec-Fetch-Dest: empty" ^
  -H "Referer: https://www.loft.com.br/" ^
  -H "Accept-Language: en-US,en;q=0.9" ^
  --compressed

I am unsure which would be the proper way to convert this to a command to be used in python module requests - so I used this website - https://curl.trillworks.com/ - to do it.

The result is:

import requests

headers = {
    'Connection': 'keep-alive',
    'Accept': '*/*',
    'Access-Control-Request-Method': 'GET',
    'Access-Control-Request-Headers': 'loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id',
    'Origin': 'https://www.loft.com.br',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.loft.com.br/',
    'Accept-Language': 'en-US,en;q=0.9',
}

params = (
    ('city', 'S\xE3o Paulo'),
    ('facetFilters/[/]', 'address.city:S\xE3o Paulo'),
    ('limit', '18'),
    ('limitedColumns', 'true'),
    ('loftUserId', '417b37df-19ab-4014-a800-688c5acc039d'),
    ('offset', '28'),
    ('orderBy/[/]', 'rankB'),
    ('orderByStatus', '\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\''),
    ('originType', 'LISTINGS_LOAD_MORE'),
    ('q', 'pin'),
    ('status/[/]', ['FOR_SALE', 'JUST_LISTED', 'DEMOLITION', 'COMING_SOON', 'SOLD']),
)

response = requests.options('https://landscape-api.loft.com.br/listing/search', headers=headers, params=params)

However, when I try to run it, I get a 204.

So my questions are:

  1. What is the proper/best way to identify requests from this website? Are there any better alternatives to what I did?
  2. Once identified, is copy as curl the best way to replicate the command?
  3. How to best replicate the command in Python?
  4. Why am I getting a 204?


from Web Scraping - Identifying and executing a request ; troubleshooting

No comments:

Post a Comment