I am having some trouble scraping data from the following website: https://www.loft.com.br/apartamentos/sao-paulo-sp?q=pin
When we load the page, it loads the first ~30 posts on real state in the city of Sao Paulo. If we scroll down, it loads more posts.
Usually I would use selenium to get around this - but I want to learn how to do it properly - I imagine that is by fiddling with requests.
By using inspect on chrome, and watching for what happens when we scroll down, I can see a request made which I presume is what retrieves the new posts.
If I copy its content as curl, I get the following command:
curl "https://landscape-api.loft.com.br/listing/search?city=S^%^C3^%^A3o^%^20Paulo^&facetFilters^\[^\]=address.city^%^3AS^%^C3^%^A3o^%^20Paulo^&limit=18^&limitedColumns=true^&loftUserId=417b37df-19ab-4014-a800-688c5acc039d^&offset=28^&orderBy^\[^\]=rankB^&orderByStatus=^%^27FOR_SALE^%^27^%^2C^%^20^%^27JUST_LISTED^%^27^%^2C^%^20^%^27DEMOLITION^%^27^%^2C^%^20^%^27COMING_SOON^%^27^%^20^%^2C^%^20^%^27SOLD^%^27^&originType=LISTINGS_LOAD_MORE^&q=pin^&status^\[^\]=FOR_SALE^&status^\[^\]=JUST_LISTED^&status^\[^\]=DEMOLITION^&status^\[^\]=COMING_SOON^&status^\[^\]=SOLD" ^
-X "OPTIONS" ^
-H "Connection: keep-alive" ^
-H "Accept: */*" ^
-H "Access-Control-Request-Method: GET" ^
-H "Access-Control-Request-Headers: loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id" ^
-H "Origin: https://www.loft.com.br" ^
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" ^
-H "Sec-Fetch-Mode: cors" ^
-H "Sec-Fetch-Site: same-site" ^
-H "Sec-Fetch-Dest: empty" ^
-H "Referer: https://www.loft.com.br/" ^
-H "Accept-Language: en-US,en;q=0.9" ^
--compressed
I am unsure which would be the proper way to convert this to a command to be used in python module requests - so I used this website - https://curl.trillworks.com/ - to do it.
The result is:
import requests
headers = {
'Connection': 'keep-alive',
'Accept': '*/*',
'Access-Control-Request-Method': 'GET',
'Access-Control-Request-Headers': 'loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id',
'Origin': 'https://www.loft.com.br',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://www.loft.com.br/',
'Accept-Language': 'en-US,en;q=0.9',
}
params = (
('city', 'S\xE3o Paulo'),
('facetFilters/[/]', 'address.city:S\xE3o Paulo'),
('limit', '18'),
('limitedColumns', 'true'),
('loftUserId', '417b37df-19ab-4014-a800-688c5acc039d'),
('offset', '28'),
('orderBy/[/]', 'rankB'),
('orderByStatus', '\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\''),
('originType', 'LISTINGS_LOAD_MORE'),
('q', 'pin'),
('status/[/]', ['FOR_SALE', 'JUST_LISTED', 'DEMOLITION', 'COMING_SOON', 'SOLD']),
)
response = requests.options('https://landscape-api.loft.com.br/listing/search', headers=headers, params=params)
However, when I try to run it, I get a 204.
So my questions are:
- What is the proper/best way to identify requests from this website? Are there any better alternatives to what I did?
- Once identified, is copy as curl the best way to replicate the command?
- How to best replicate the command in Python?
- Why am I getting a 204?
from Web Scraping - Identifying and executing a request ; troubleshooting

No comments:
Post a Comment