Hemant Vishwakarma: BeautifulSoup: 6k records - but stops after parsing 20 lines

for the goal to create a quick overview on a set of opportunities for free volunteering in Europe

It is aimed to get all the 6k target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below - the images and the explanation and description of the aimed goals and the data which are wanted.

we fetch ..

https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163

and so forth and so forth

since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines.

see my current Approach: So I run this mini-approach here:

import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm


first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"


def catch(url):
    with requests.Session() as req:
        pages = []
        print("Loading All IDS\n")
        for item in tqdm(range(0, 347)):
            r = req.get(url.format(item))
            soup = BeautifulSoup(r.content, 'html.parser')
            numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
                "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
            pages.append(numbers)
        return numbers


def parse(url):
    links = catch(first)
    with requests.Session() as req:
        with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Name", "Address", "Site", "Phone",
                             "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
            print("\nParsing Now... \n")
            for link in tqdm(links):
                r = req.get(url.format(link))
                soup = BeautifulSoup(r.content, 'html.parser')
                task = soup.find("section", class_="col-sm-12").contents
                name = task[1].text
                add = task[3].find(
                    "i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
                try:
                    site = task[3].find("a", class_="link-default").get("href")
                except:
                    site = "N/A"
                try:
                    phone = task[3].find(
                        "i", class_="fa fa-phone").next_element.strip()
                except:
                    phone = "N/A"
                desc = task[3].find(
                    "h3", class_="eyp-project-heading underline").find_next("p").text
                scope = task[3].findAll("span", class_="pull-right")[1].text
                rec = task[3].select("tbody td")[1].text
                send = task[3].select("tbody td")[-1].text
                pic = task[3].select(
                    "span.vertical-space")[0].text.split(" ")[1]
                oid = task[3].select(
                    "span.vertical-space")[-1].text.split(" ")[1]
                topic = [item.next_element.strip() for item in task[3].select(
                    "i.fa.fa-check.fa-lg")]
                writer.writerow([name, add, site, phone, desc,
                                 scope, rec, send, pic, oid, "".join(topic)])


parse(second)

but this stops after parsing 20 results

note; i am wanting to return pages not numbers.

since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.

here i seem o have some mistakes:

i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted. i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers

That said: since i want to iterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.

any idea - how to get the parser to give out all the 6k results.

from BeautifulSoup: 6k records - but stops after parsing 20 lines

Hemant Vishwakarma

Monday, 17 May 2021

BeautifulSoup: 6k records - but stops after parsing 20 lines

No comments:

Post a Comment