Sunday, 20 March 2022

Selenium crashing and trying to troubleshoot

I had a script that uses Selenium for web-scraping. It recently started failing. I'm running it from a python script on Ubuntu.

Here's how I've set up the webdriver:

#set driver options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
driver = webdriver.Chrome(executable_path=chrome_driver_binary, chrome_options=chrome_options)

And the web-scraper, basically just iterating through a bunch of URLs to grab data:

for i in range(1,10):

    time.sleep(1.5)

    #cycle through pages in range
    driver.get(base_url + str(i))
    pageURL = base_url + str(i)

    # get events links
    event_list = driver.find_elements_by_css_selector('div[class^=AtIvjk2YjzXSULT1cmVx] a[class^=HsqHp2xM2FkfSdjy1mlU]')
    # collect href attribute of events in even_list
    events.extend(list(event.get_attribute("href") for event in event_list))

print("total events: ", (len(events)))

#GET request user-agent
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}


# iterate through all events and open them.
item = {}
allEvents = []
for event in events:

    driver.get(event)
    currentUrl = driver.current_url

Sometimes, the script works, other times it fails with this:

     File "/home/ubuntu/scraper/my_scraper.py", line 136, in <module>
    currentUrl = driver.current_url
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 669, in current_url
    return self.execute(Command.GET_CURRENT_URL)['value']
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=99.0.4844.51)

OK, so it seems it's failing on this line, or the line directly above it:

currentUrl = driver.current_url

As mentioned, it works sometimes, but other times fails. I've tried:

  • Updating Chromedriver/Chrome to same version on Ubuntu.

  • Checked my instance disk space: it's got 2.7G available.

  • I've increased my instance size as well.

How else can I troubleshoot this?



from Selenium crashing and trying to troubleshoot

No comments:

Post a Comment