I'm writing a personal Chrome extension (Note: these can make cross-origin requests. See https://developer.chrome.com/extensions/xhr).
I'm trying to use XMLHttpRequest to access a certain website, then extract data from it using javascript. My problem is that this website often returns its "robots" page to me instead of the HTML. Of course, when I visit this website in my browser, it works fine. Also, if I visit the website with my browser THEN make the XHR request, it also works fine.
I thought the problem might be that my request headers were not correct. I then modified my request headers so that they were identical to my browser ones (using chrome.webRequest). Unfortunately, this did not work either. One thing I noticed is that my browser has some cookies in its request headers, which I do not know how to replicate (see below).
Therefore, my question is: how can I solve or debug this problem? Is there a way to find out WHY the site is delivering its "robots" page to me? If I view its robots.txt file, I am not breaking any obvious rules. I am pretty new to javascript and web programming, so sorry if this is a basic question.
Here is an example of my browser request headers:
GET /XXX/XXX HTTP/1.1
Host: www.example.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8
Referer: https://www.example.com/XXX
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: D_IID=XXX-XXX-XXX-XXX-XXX; D_UID=XXX-XXX-XXX-XXX-XXX; D_ZID=XXX-XXX-XXX-XXX-XXX; D_ZUID=XXX-XXX-XXX-XXX-XXX; D_HID=XXX-XXX-XXX-XXX-XXX; D_SID=XXX/XXX/XXX
UPDATE
I am also including my "General" headers defined in Chrome:
Request URL: https://www.example.com/XXX
Request Method: GET
Status Code: 200 OK
Remote Address: XXX
Referrer Policy: no-referrer-when-downgrade
And my response headers:
Cache-Control: private, no-cache, no-store, must-revalidate
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html
Date: Wed, 06 Feb 2019 XXX GMT
Edge-Control: no-store, bypass-cache
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Server: XXX
Surrogate-Control: no-store, bypass-cache
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-DB: 0
X-DW: 0
X-DZ: XXX
UPDATE 2
After looking at the response HTML, I am not sure what it is. I originally thought it was some kind of ROBOTS response because it says META NAME="ROBOTS", but now I am less sure. Here is the general structure of the HTML.
<!DOCTYPE html>
<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 XXX GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=example.com" />
<script type="text/javascript">
// SOME JAVASCRIPT
</script>
<script type="text/javascript" src="/example.js" defer></script></head>
<body>
<div id="XXX"> </div>
</body>
</html>
from Why is my XMLHttpRequest getting a robots.txt response
No comments:
Post a Comment