I had a script that uses Selenium for web-scraping. It recently started failing. I'm running it from a python script on Ubuntu.
Here's how I've set up the webdriver:
#set driver options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
driver = webdriver.Chrome(executable_path=chrome_driver_binary, chrome_options=chrome_options)
And the web-scraper, basically just iterating through a bunch of URLs to grab data:
for i in range(1,10):
time.sleep(1.5)
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=AtIvjk2YjzXSULT1cmVx] a[class^=HsqHp2xM2FkfSdjy1mlU]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
print("total events: ", (len(events)))
#GET request user-agent
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
# iterate through all events and open them.
item = {}
allEvents = []
for event in events:
driver.get(event)
currentUrl = driver.current_url
Sometimes, the script works, other times it fails with this:
File "/home/ubuntu/scraper/my_scraper.py", line 136, in <module>
currentUrl = driver.current_url
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 669, in current_url
return self.execute(Command.GET_CURRENT_URL)['value']
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=99.0.4844.51)
OK, so it seems it's failing on this line, or the line directly above it:
currentUrl = driver.current_url
As mentioned, it works sometimes, but other times fails. I've tried:
-
Updating Chromedriver/Chrome to same version on Ubuntu.
-
Checked my instance disk space: it's got 2.7G available.
-
I've increased my instance size as well.
How else can I troubleshoot this?
from Selenium crashing and trying to troubleshoot
No comments:
Post a Comment