Friday, 4 February 2022

Downloading images with selenium and requests: why does the .get_attribute() method of a WebElement returns a URL in base64?

I have written a webscraping program that goes to an online marketplace like www.tutti.ch, searches for a category key word, and then downloads all the resulting photos of the search result to a folder.

#! python3
# imageSiteDownloader_stack.py - A program that goes to an online marketplace like
# tutti.ch, searches for a category of photos, and then downloads all the
# resulting images.

import requests, os
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click() # accepts cookies terms
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys('Gartenstuhl') # enters search key word
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # clicks submit button
os.makedirs('tuttiBilder', exist_ok=True) # creates new folder
images = browser.find_elements(By.TAG_NAME, 'img') # stores every img element in a list
for im in images:
    imageURL = im.get_attribute('src') # get the URL of the image
    print('Downloading image %s...' % (imageURL))
    res = requests.get(imageURL) # downloads the image
    res.raise_for_status()
    imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
    for chunk in res.iter_content(100000): # writes to the image file
        imageFile.write(chunk)
    imageFile.close()
print('Done.')
browser.quit()

My program crashes at line 26, the exception is as follows:

enter image description here

The program downloads the first couple of photos correctly, but then, suddenly, crashes.

Looking for solutions on stackoverflow, I have found this post: Requests : No connection adapters were found for, error in Python3

The answer of the post above suggests that the problem arises because of a newline charachter in the URL.

I checked the source URLs of the photos im the HTML code that couldn't be downloaded. They seem to be OK.

The problem seems to be either the browser.find_elements() method which parses the 'src' attribute values incorrectly or the .get_attribute() method which cannot fetch some of the URLs correctly. Instead of getting something like

https://c.tutti.ch/images/23452346536.jpg

the method gives back strings like

data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

Of course, this is not a valid URL which the requests.get() method can use to download the image. I did some research and I have found out that this might be a base64 string...

Why does the .get_attribute() method return a base64 string in some of the cases? Can I prevent it to do so? Or do I have to convert it to a normal string?

Update: Another approach using beautifulsoup for parsing instead ob WebDriver. (This code is not working yet. The Data URLs are still a problem)

import requests, sys, os, bs4
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click()
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys(sys.argv[1:])
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # https://www.tutorialspoint.com/how-to-locate-element-by-partial-id-match-in-selenium
os.makedirs('tuttiBilder', exist_ok=True)

url = browser.current_url
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

#Check for errors from here
images = soup.select('div[style] > img')

for im in images:
    imageURL = im.get('src') # get the URL of the image
    print('Downloading image %s...' % (imageURL))
    res = requests.get(imageURL) # downloads the image
    res.raise_for_status()
    imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
    for chunk in res.iter_content(100000): # writes to the image file
        imageFile.write(chunk)
    imageFile.close()
print('Done.')
browser.quit()


from Downloading images with selenium and requests: why does the .get_attribute() method of a WebElement returns a URL in base64?

No comments:

Post a Comment