I'm currently developing some code to scrape text from websites. I'm not interested to scrape the entire page, but just in sections of the page that contain certain words. I've managed to do so for most URLs using the .find_all("p") command, however this does not work for URLs that are directed to a PDF.
I cannot seem to find a way to open a PDF as a text and then divide the text into paragraphs. This is what I would like to do: first 1) Open a PDF embedded URL as a text, and 2) Divide this text into multiple paragraphs. This way, I can scrape only paragraphs containing certain words.
Below is the code I'm currently using to scrape paragraphs containing certain words for "normal" URLs. Any tips to make this work for PDF embedded URLs (such as the variable 'url2', code below) is much appreciated!
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
url1 = "https://brainybackpackers.com/best-places-for-whale-watching-in-the-world/"
url2 = "https://www.environment.gov.au/system/files/resources/7f15bfc1-ed3d-40b6-a177-c81349028ef6/files/aust-national-guidelines-whale-dolphin-watching-2017.pdf"
url = url1
req = Request(url, headers={"User-Agent": 'Mozilla/5.0'})
page = urlopen(req, timeout = 5) # Open page within 5 seconds. This line skips 'empty' websites
htmlParse = BeautifulSoup(page.read(), 'lxml')
SearchWords = ["orca", "killer whale", "humpback"] # text must contain these words
# Check if the article text mentions the SearchWord(s). If so, continue the analysis.
if any(word in htmlParse.text for word in SearchWords):
textP = ""
text = ""
# Look for paragraphs ("p") that contain a SearchWord
for word in SearchWords:
print(word)
for para in htmlParse.find_all("p", text = re.compile(word)):
textParagraph = para.get_text()
textP = textP + textParagraph
text= text + textP
print(text)
from Only scrape paragraphs containing certain words in PDF embedded URLs
No comments:
Post a Comment