I have written a webscraping program that goes to an online marketplace like www.tutti.ch, searches for a category key word, and then downloads all the resulting photos of the search result to a folder.
#! python3
# imageSiteDownloader_stack.py - A program that goes to an online marketplace like
# tutti.ch, searches for a category of photos, and then downloads all the
# resulting images.
import requests, os
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click() # accepts cookies terms
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys('Gartenstuhl') # enters search key word
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # clicks submit button
os.makedirs('tuttiBilder', exist_ok=True) # creates new folder
images = browser.find_elements(By.TAG_NAME, 'img') # stores every img element in a list
for im in images:
imageURL = im.get_attribute('src') # get the URL of the image
print('Downloading image %s...' % (imageURL))
res = requests.get(imageURL) # downloads the image
res.raise_for_status()
imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
for chunk in res.iter_content(100000): # writes to the image file
imageFile.write(chunk)
imageFile.close()
print('Done.')
browser.quit()
My program crashes at line 26, the exception is as follows:
The program downloads the first couple of photos correctly, but then, suddenly, crashes.
Looking for solutions on stackoverflow, I have found this post: Requests : No connection adapters were found for, error in Python3
The answer of the post above suggests that the problem arises because of a newline charachter in the URL.
I checked the source URLs of the photos im the HTML code that couldn't be downloaded. They seem to be OK.
The problem seems to be either the browser.find_elements() method which parses the 'src' attribute values incorrectly or the .get_attribute() method which cannot fetch some of the URLs correctly. Instead of getting something like
https://c.tutti.ch/images/23452346536.jpg
the method gives back strings like

Of course, this is not a valid URL which the requests.get() method can use to download the image. I did some research and I have found out that this might be a base64 string...
Why does the .get_attribute() method return a base64 string in some of the cases? Can I prevent it to do so? Or do I have to convert it to a normal string?
Update: Another approach using beautifulsoup for parsing instead ob WebDriver. (This code is not working yet. The Data URLs are still a problem)
import requests, sys, os, bs4
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click()
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys(sys.argv[1:])
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # https://www.tutorialspoint.com/how-to-locate-element-by-partial-id-match-in-selenium
os.makedirs('tuttiBilder', exist_ok=True)
url = browser.current_url
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#Check for errors from here
images = soup.select('div[style] > img')
for im in images:
imageURL = im.get('src') # get the URL of the image
print('Downloading image %s...' % (imageURL))
res = requests.get(imageURL) # downloads the image
res.raise_for_status()
imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
for chunk in res.iter_content(100000): # writes to the image file
imageFile.write(chunk)
imageFile.close()
print('Done.')
browser.quit()
from Downloading images with selenium and requests: why does the .get_attribute() method of a WebElement returns a URL in base64?

No comments:
Post a Comment