I'd need to create a dataframe with the following columns:
WEB | Country | Organisation
I'm extracting these information from a website: however, there are some webs which do not have any information on the website. This is causing me some issues in updating the dataframe. Unfortunately, the code can work only one website a time, otherwise a captcha appears. Please see below the code to have an idea on the individual output:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
element=[]
organisation=[]
x=['stackoverflow.com'] # ['livevsfox.ca'] I would suggest to try first one, then the other one
frame_dict={}
element.append(x) # I am keeping this just because I'd like to consider a for loop in future
chrome_options = webdriver.ChromeOptions()
driver=webdriver.Chrome('path')
response=driver.get('website/'+x) # here x should stackoverflow.com, then the other web
try:
wait = WebDriverWait(driver, 30)
driver.execute_script("window.scrollTo(0, 1000)")
try:
error = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.selection div.container h2"))) # updated after answer from another post and comment below
except:
continue
# Country
c = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div"))).text
country.append(c)
# Organisation
try:
org=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Organisation']/../following-sibling::div"))).text
organisation.append(org)
except:
organisation.append("Data not available")
except:
break
driver.quit()
frame_dict.update({'WEB': element, 'Organisation': organisation, 'Country': country})
df=pd.DataFrame.from_dict(frame_dict)
The code should do the following:
- for
x = stackoverflow.com(this is just an example of working url), open chrome; if there is info, then extract information on organisation and country; if there is not, add 'Missing' to the dataframe; exit chrome; - for
x = livevsfox.ca, open chrome; if there is info, then extract information on organisation and country; if there is not, then add 'Missing' inOrganisationandCountrycolumns; exit chrome.
The expected output would be, then:
WEB Country Organisation
stackoverflow.com US Stack Exchange, Inc.
livevsfox.ca Missing Missing
livevsfox.ca returns, in fact, the following message:
Sorry, livevsfox.ca could not be found or reached (error code 404)
message that does not appear when I look for stackoverflow.com. Since stackoverflow.com has Country and Organisation, I can add this info in the dataframe, but I can't do the same for livevsfox.ca . I'm thinking a possible solution could be the following:
- check if the
h2 classelement contains the message above ("Sorry, x could not be found or reached (error code 404)") : this would mean that the web has no information detected; - if the web has no information, then add
Missing(orNA, up to you) in the dataframe; - otherwise, the web has information (Owner & Country) to be added in the dataframe.
I hope you can provide some help.
from If condition for adding information into a dataframe
No comments:
Post a Comment