I'm currently learning web scraping with Scrapy, and I wanted to scrape the data from a website that uses the JavaScript. I tried to integrate Scrapy with Selenium, but I will also be grateful for the solution that uses Splash, for example.
I can't share the website I'd like to crawl (it requires logging in anyway), but I'll try to describe the setup as accurate as I can.
To access the data I want to scrape, one needs to perform the following steps:
- Open the website
- Log in
- Choose an element from the JavaScript-powered list of elements
- Choose another element from another JavaScript-powered list of elements
- Then we hit another list of clickable elements and the goal is to scrape all elements. Unfortunately, to access the data, one needs to click each element and choose the appropriate tab. The operation needs to be repeated for every element of
id='element-3'. So after accessing a data and scraping it for one element 3 we need to go back, click the subsequent element ofid='element-3', access the data tab, scrape it, go back and so on.
In a tutorial I followed, the Selenium actions were specified in __init__ method and only driver.page_source was stored and passed later to parse method. However, the content of HTML is changing with each action, and I can't really access the data without clicking on each element in a list. Therefore, I probably need a loop somewhere. And I have no idea how to structure my scraper so that it'll be efficient.
The pseudo code (not a real example!) to mimic all the operation needed to access the data is below. Please note that I extract only the 10th element of the list that should be looped over, while in my desirable solution I want to scrape the data from all the list elements.
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
super().__init__()
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_experimental_option("detach", True)
driver = webdriver.Chrome(executable_path="./chromedriver")
driver.get('https://example.com') # This is not a real URL
# Logging in to the site (ommited for brevity)
element1_xpath = "//div[@id='element-1']"
WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
.until(expected_conditions.presence_of_element_located((By.XPATH, element1_xpath))).click()
element2_xpath = "//div[@id='element-2']"
WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
.until(expected_conditions.presence_of_element_located((By.XPATH, element2_xpath))).click()
element3_xpath = "(//div[@id='element-3'])[10]" # Here, I select only the 10th element, but the goal is to
# iterate over all elements found
WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
.until(expected_conditions.presence_of_element_located((By.XPATH, element3_xpath))).click()
data_xpath = "//div[@id='data-table']"
WebDriverWait(driver, 10, ignored_exceptions=(NoSuchElementException,StaleElementReferenceException)) \
.until(expected_conditions.presence_of_element_located((By.XPATH, data_xpath))).click()
rows = driver.find_elements_by_xpath("//tbody/tr[position()>1]")
for row in rows:
print(row)
self.html = driver.page_source
def parse(self, response):
pass
from How to organise a Scrapy crawler of JavaScript website that should loop over list of elements
No comments:
Post a Comment