Friday, 4 October 2019

How to efficiently parse large HTML div-class and span data on Python BeautifulSoup?

The data needed:

I want to scrape through two webpages, one here: https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL and the other: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL. From the first page, I need values of the row called Total Assets. This would be 5 values in that row named: 365,725,000 375,319,000 321,686,000 290,479,000 231,839,000 Then I need 5 values of the row named Total Current Liabilities. These would be: 43,658,000 38,542,000 27,970,000 20,722,000 11,506,000 From the second link, I need 10 values of the row named Operating Income or Loss. These would be: 52,503,000 48,999,000 55,241,000 33,790,000 18,385,000.

My code:

This is what I have written so far. I can extract the value within the div class if I just put it in a variable as shown below. However, how do I loop efficiently through the 'div' classes as there are thousands of them in the page. In other words, how do I find just the values I am looking for?

# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
soup1 = BeautifulSoup("""<div class="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg" data-test="fin-col"><span>321,686,000</span></div>""", "html.parser")
spup2 = BeautifulSoup("""<span data-reactid="1377">""", "html.parser");

#This works
print(soup1.find("div", class_="D(tbc) Ta(end) Pstart(6px) Pend(4px) Bxz(bb) Py(8px) BdB Bdc($seperatorColor) Miw(90px) Miw(110px)--pnclg").text)

#How to loop through all the relevant div classes? 


from How to efficiently parse large HTML div-class and span data on Python BeautifulSoup?

No comments:

Post a Comment