for the goal to create a quick overview on a set of opportunities for free volunteering in Europe
It is aimed to get all the 6k target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below - the images and the explanation and description of the aimed goals and the data which are wanted.
we fetch ..
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
and so forth and so forth
since we have more than 6000 records - i admit that i get the results. but the script only gives back 20 records - i.e. 20 lines.
see my current Approach: So I run this mini-approach here:
import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm
first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"
def catch(url):
with requests.Session() as req:
pages = []
print("Loading All IDS\n")
for item in tqdm(range(0, 347)):
r = req.get(url.format(item))
soup = BeautifulSoup(r.content, 'html.parser')
numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
"a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
pages.append(numbers)
return numbers
def parse(url):
links = catch(first)
with requests.Session() as req:
with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(["Name", "Address", "Site", "Phone",
"Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
print("\nParsing Now... \n")
for link in tqdm(links):
r = req.get(url.format(link))
soup = BeautifulSoup(r.content, 'html.parser')
task = soup.find("section", class_="col-sm-12").contents
name = task[1].text
add = task[3].find(
"i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
try:
site = task[3].find("a", class_="link-default").get("href")
except:
site = "N/A"
try:
phone = task[3].find(
"i", class_="fa fa-phone").next_element.strip()
except:
phone = "N/A"
desc = task[3].find(
"h3", class_="eyp-project-heading underline").find_next("p").text
scope = task[3].findAll("span", class_="pull-right")[1].text
rec = task[3].select("tbody td")[1].text
send = task[3].select("tbody td")[-1].text
pic = task[3].select(
"span.vertical-space")[0].text.split(" ")[1]
oid = task[3].select(
"span.vertical-space")[-1].text.split(" ")[1]
topic = [item.next_element.strip() for item in task[3].select(
"i.fa.fa-check.fa-lg")]
writer.writerow([name, add, site, phone, desc,
scope, rec, send, pic, oid, "".join(topic)])
parse(second)
but this stops after parsing 20 results
note; i am wanting to return pages not numbers.
since i want to itterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.
here i seem o have some mistakes:
i guess that the catch function has a fancy mistake in it: here we are returning numbers, but I'm pretty sure that this is a mistake: we re intending to be returning pages, which means when we iterate over the results from catch(first) in the other function, we are not iterating over everything that is wanted. i guess that i need to include a fix: we need to return pages at the bottom of that function, instead of doing return numbers
That said: since i want to iterate over pages not numbers; but anyway - if i change from return numbers to return pages - i get no better results.
any idea - how to get the parser to give out all the 6k results.
from BeautifulSoup: 6k records - but stops after parsing 20 lines
No comments:
Post a Comment