Friday, 27 August 2021

How to make the data extraction from webpage with selenium more robust and efficient?

I want to extract all option chain data from yahoo finance webpage,take put option chain data for simplicity. At first ,load all packages used in the program:

import time 
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

The function to write some company's put option chain data into a directory:

def write_option_chain(code):
    browser = webdriver.Chrome()
    browser.maximize_window()
    url = "https://finance.yahoo.com/quote/{}/options?p={}".format(code,code)
    browser.get(url)
    WebDriverWait(browser,10).until(EC.visibility_of_element_located((By.XPATH, ".//select/option")))
    time.sleep(25)
    date_elem = browser.find_elements_by_xpath(".//select/option")
    time_span = len(date_elem)
    print('{} option chains exists in {}'.format(time_span,code)) 
    df_all = pd.DataFrame()
    for item in range(1,time_span):
        element_date = browser.find_element_by_xpath('.//select/option[{}]'.format(item))
        print("parsing {}'s  put option chain on {} now".format(code,element_date.text))
        element_date.click()
        WebDriverWait(browser,10).until(EC.visibility_of_all_elements_located((By.XPATH, ".//table[@class='puts W(100%) Pos(r) list-options']//td")))
        time.sleep(11)
        put_table = browser.find_element_by_xpath((".//table[@class='puts W(100%) Pos(r) list-options']"))
        put_table_string = put_table.get_attribute('outerHTML')
        df_put = pd.read_html(put_table_string)[0]
        df_all = df_all.append(df_put)
    browser.close()
    browser.quit()
    df_all.to_csv('/tmp/{}.csv'.format(code))
    print('{} otpion chain written into csv file'.format(code))

To test the write_option_chain with a list:

nas_list = ['aapl','adbe','adi','adp','adsk']
for item in nas_list:
    try:
        write_option_chain(code=item)
    except:
        print("check what happens to {} ".format(item))
        continue
    time.sleep(5)

The output info shows:

#i omitted many lines for simplicity
18 option chains exists in aapl
parsing aapl's  put option chain on August 27, 2021 now
check what happens to aapl 
check what happens to adbe 
12 option chains exists in adi
parsing adi's  put option chain on December 17, 2021 now
adi otpion chain written into csv file
11 option chains exists in adp
parsing adp's  put option chain on August 27, 2021 now
adp otpion chain written into csv file
check what happens to adsk 

We make a summary from above info:

1.only adp and adi 's put option chain data written into desired directory.
2.get only part of aapl and adp 's option chain data
3.can't open adsk's option webpage.
4.it takes almost 20 minutes to execute.

How to make the data extraction from webpage with selenium more robust and efficient?



from How to make the data extraction from webpage with selenium more robust and efficient?

No comments:

Post a Comment