Friday, 10 September 2021

If condition for adding information into a dataframe

I'd need to create a dataframe with the following columns:

WEB | Country | Organisation

I'm extracting these information from a website: however, there are some webs which do not have any information on the website. This is causing me some issues in updating the dataframe. Unfortunately, the code can work only one website a time, otherwise a captcha appears. Please see below the code to have an idea on the individual output:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

    element=[]
    organisation=[]

    x=['stackoverflow.com'] # ['livevsfox.ca'] I would suggest to try first one, then the other one

    frame_dict={}

    
    element.append(x) # I am keeping this just because I'd like to consider a for loop in future
    
    chrome_options = webdriver.ChromeOptions()
                driver=webdriver.Chrome('path')
        
    response=driver.get('website/'+x) # here x should stackoverflow.com, then the other web
    
    try:
    
        wait = WebDriverWait(driver, 30)
        driver.execute_script("window.scrollTo(0, 1000)")
        
        try: 

            error = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.selection div.container h2"))) # updated after answer from another post and comment below

        except: 
            continue

        # Country
        c = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div"))).text
        country.append(c)   
        
        # Organisation
        try:
            org=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Organisation']/../following-sibling::div"))).text
            organisation.append(org)  
        except: 
            organisation.append("Data not available")

    except: 
      break

    driver.quit()

    frame_dict.update({'WEB': element, 'Organisation': organisation, 'Country': country}) 
    df=pd.DataFrame.from_dict(frame_dict)

The code should do the following:

  • for x = stackoverflow.com (this is just an example of working url), open chrome; if there is info, then extract information on organisation and country; if there is not, add 'Missing' to the dataframe; exit chrome;
  • for x = livevsfox.ca, open chrome; if there is info, then extract information on organisation and country; if there is not, then add 'Missing' in Organisation and Country columns; exit chrome.

The expected output would be, then:

WEB                      Country      Organisation
stackoverflow.com          US       Stack Exchange, Inc.
livevsfox.ca             Missing       Missing

livevsfox.ca returns, in fact, the following message:

Sorry, livevsfox.ca could not be found or reached (error code 404)

message that does not appear when I look for stackoverflow.com. Since stackoverflow.com has Country and Organisation, I can add this info in the dataframe, but I can't do the same for livevsfox.ca . I'm thinking a possible solution could be the following:

  • check if the h2 class element contains the message above ("Sorry, x could not be found or reached (error code 404)") : this would mean that the web has no information detected;
  • if the web has no information, then add Missing (or NA, up to you) in the dataframe;
  • otherwise, the web has information (Owner & Country) to be added in the dataframe.

I hope you can provide some help.



from If condition for adding information into a dataframe

No comments:

Post a Comment