Wednesday, 1 May 2019

Can't get desired results using try/except clause within scrapy

I've written a script in scrapy to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I'm trying to do is parse all the movie links from it's landing page and then fetch the name of each movie from it's target page. My following script can use rotation of proxies.

I know there is an easier way to change proxies, like it is described here HttpProxyMiddleware but I would still like to stick to the way I'm trying here.

website link

This is my current attempt (It keeps using new proxies to fetch a valid response but every time it gets 503 Service Unavailable):

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_proxies():   
    response = requests.get("https://www.us-proxy.org/")
    soup = BeautifulSoup(response.text,"lxml")
    proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
    return proxy

class ProxySpider(scrapy.Spider):
    name = "proxiedscript"
    handle_httpstatus_list = [503]
    proxy_vault = get_proxies()
    check_url = "https://yts.am/browse-movies"

    def start_requests(self):
        random.shuffle(self.proxy_vault)
        proxy_url = next(cycle(self.proxy_vault))
        request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
        request.meta['https_proxy'] = f'http://{proxy_url}'
        yield request

    def parse(self,response):
        print(response.meta)
        if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get():
            random.shuffle(self.proxy_vault)
            proxy_url = next(cycle(self.proxy_vault))
            request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
            request.meta['https_proxy'] = f'http://{proxy_url}'
            yield request

        else:
            for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
                nlink = response.urljoin(item)
                yield scrapy.Request(nlink,callback=self.parse_details)

    def parse_details(self,response):
        name = response.css("#movie-info h1::text").get()
        yield {"Name":name}

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(ProxySpider)
    c.start()

To make sure whether the request is being proxied, I printed response.meta and could get results like this {'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.

As I've overused the link to check how the proxied request within scrapy works, I'm getting 503 Service Unavailable error at this moment and I can see this keyword within the response DDoS protection by Cloudflare. However, I get valid response when I try with requests module applying the same logic I implemented here.

My earlier question: why I can't get the valid response as (I suppose) I'm using proxies in the right way? [solved]

Bounty Question: how can I define try/except clause within my script so that it will try with different proxies once it throws connection error with a certain proxy?



from Can't get desired results using try/except clause within scrapy

1 comment:

  1. NZeTA for New Zealand
    On August 2019, the Government of New Zealand implemented the NZeTA to facilitate obtaining an authorization to visit the country for Tourism and Transit purposes. The new online process is valid for 2 years and allows multiple entries to New Zealand.

    NZeTa can be easily obtained online, simplifying the process to enter New Zealand by offering travelers the option to apply directly online for a travel authorization.

    The eTA for New Zealand grants its holder multiple entries to the country during its 2-year validity. Holders of the NZeTA are generally allowed to stay in the country for stays of up to 90 days from their date of arrival in New Zealand. Passports must have a minimum validity of at least three (3) months from the expected date of departure from New Zealand.

    Visitors traveling to New Zealand for short-term stays can apply to have a NZeTA to travel to and within the country. To be able to obtain a valid eTA travel authority to visit New Zealand, travelers should complete an NZeTA online application. The online New Zealand eTA application is straightforward and simple to complete. To get the New Zealand eTA, citizens of eligible countries are required to carefully fill out an online application form.

    ReplyDelete