Monday, 11 February 2019

Why is my XMLHttpRequest getting a robots.txt response

I'm writing a personal Chrome extension (Note: these can make cross-origin requests. See https://developer.chrome.com/extensions/xhr).

I'm trying to use XMLHttpRequest to access a certain website, then extract data from it using javascript. My problem is that this website often returns its "robots" page to me instead of the HTML. Of course, when I visit this website in my browser, it works fine. Also, if I visit the website with my browser THEN make the XHR request, it also works fine.

I thought the problem might be that my request headers were not correct. I then modified my request headers so that they were identical to my browser ones (using chrome.webRequest). Unfortunately, this did not work either. One thing I noticed is that my browser has some cookies in its request headers, which I do not know how to replicate (see below).

Therefore, my question is: how can I solve or debug this problem? Is there a way to find out WHY the site is delivering its "robots" page to me? If I view its robots.txt file, I am not breaking any obvious rules. I am pretty new to javascript and web programming, so sorry if this is a basic question.


Here is an example of my browser request headers:

GET /XXX/XXX HTTP/1.1

Host: www.example.com

Connection: keep-alive

Upgrade-Insecure-Requests: 1

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8

Referer: https://www.example.com/XXX

Accept-Encoding: gzip, deflate, br

Accept-Language: en-US,en;q=0.9

Cookie: D_IID=XXX-XXX-XXX-XXX-XXX; D_UID=XXX-XXX-XXX-XXX-XXX; D_ZID=XXX-XXX-XXX-XXX-XXX; D_ZUID=XXX-XXX-XXX-XXX-XXX; D_HID=XXX-XXX-XXX-XXX-XXX; D_SID=XXX/XXX/XXX


UPDATE

I am also including my "General" headers defined in Chrome:

Request URL: https://www.example.com/XXX

Request Method: GET

Status Code: 200 OK

Remote Address: XXX

Referrer Policy: no-referrer-when-downgrade

And my response headers:

Cache-Control: private, no-cache, no-store, must-revalidate

Connection: keep-alive

Content-Encoding: gzip

Content-Type: text/html

Date: Wed, 06 Feb 2019 XXX GMT

Edge-Control: no-store, bypass-cache

Expires: Thu, 01 Jan 1970 00:00:01 GMT

Server: XXX

Surrogate-Control: no-store, bypass-cache

Transfer-Encoding: chunked

Vary: Accept-Encoding

X-DB: 0

X-DW: 0

X-DZ: XXX


UPDATE 2

After looking at the response HTML, I am not sure what it is. I originally thought it was some kind of ROBOTS response because it says META NAME="ROBOTS", but now I am less sure. Here is the general structure of the HTML.

<!DOCTYPE html>
<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 XXX GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=example.com" />
<script type="text/javascript">
// SOME JAVASCRIPT
</script>
<script type="text/javascript" src="/example.js" defer></script></head>
<body>
<div id="XXX">&nbsp;</div>
</body>
</html>


from Why is my XMLHttpRequest getting a robots.txt response

No comments:

Post a Comment