Sunday, 2 January 2022

How to download all the href (pdf) inside a class with python beautiful soup?

I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.

Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).

https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3

This is the website and below is the link:

BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://bidplus.gem.gov.in/bidlists"

#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)


from How to download all the href (pdf) inside a class with python beautiful soup?

No comments:

Post a Comment