Sunday, 7 February 2021

How to organise a Scrapy crawler of JavaScript website that should loop over list of elements

I'm currently learning web scraping with Scrapy, and I wanted to scrape the data from a website that uses the JavaScript. I tried to integrate Scrapy with Selenium, but I will also be grateful for the solution that uses Splash, for example.

I can't share the website I'd like to crawl (it requires logging in anyway), but I'll try to describe the setup as accurate as I can.

To access the data I want to scrape, one needs to perform the following steps:

  1. Open the website
  2. Log in
  3. Choose an element from the JavaScript-powered list of elements
  4. Choose another element from another JavaScript-powered list of elements
  5. Then we hit another list of clickable elements and the goal is to scrape all elements. Unfortunately, to access the data, one needs to click each element and choose the appropriate tab. The operation needs to be repeated for every element of id='element-3'. So after accessing a data and scraping it for one element 3 we need to go back, click the subsequent element of id='element-3', access the data tab, scrape it, go back and so on.

In a tutorial I followed, the Selenium actions were specified in __init__ method and only driver.page_source was stored and passed later to parse method. However, the content of HTML is changing with each action, and I can't really access the data without clicking on each element in a list. Therefore, I probably need a loop somewhere. And I have no idea how to structure my scraper so that it'll be efficient.

The pseudo code (not a real example!) to mimic all the operation needed to access the data is below. Please note that I extract only the 10th element of the list that should be looped over, while in my desirable solution I want to scrape the data from all the list elements.

   import scrapy
   from selenium import webdriver
   from selenium.webdriver.chrome.options import Options
   from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
   from selenium.webdriver.common.by import By
   from selenium.webdriver.support.ui import WebDriverWait
   from selenium.webdriver.support import expected_conditions

   class ExampleSpider(scrapy.Spider):
    name = 'example'

    def __init__(self):
        super().__init__()

        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_experimental_option("detach", True)

        driver = webdriver.Chrome(executable_path="./chromedriver")
        driver.get('https://example.com') # This is not a real URL

        # Logging in to the site (ommited for brevity)

        element1_xpath = "//div[@id='element-1']"
        WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
            .until(expected_conditions.presence_of_element_located((By.XPATH, element1_xpath))).click()
        
        element2_xpath = "//div[@id='element-2']"
        WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
            .until(expected_conditions.presence_of_element_located((By.XPATH, element2_xpath))).click()

        element3_xpath = "(//div[@id='element-3'])[10]" # Here, I select only the 10th element, but the goal is to 
                                                        # iterate over all elements found
        WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
            .until(expected_conditions.presence_of_element_located((By.XPATH, element3_xpath))).click()

        data_xpath = "//div[@id='data-table']"
        WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
            .until(expected_conditions.presence_of_element_located((By.XPATH, data_xpath))).click()

        rows = driver.find_elements_by_xpath("//tbody/tr[position()>1]")
        for row in rows:
            print(row)

        self.html = driver.page_source

    def parse(self, response):
        pass 


from How to organise a Scrapy crawler of JavaScript website that should loop over list of elements

No comments:

Post a Comment