Monday, 29 January 2024

running bs4 scraper needs to be redefined to enrich the dataset - some issues

got a bs4 scraper that works with selenium - see far below:

well - it works fine so far:

see far below my approach to fetch some data form the given page: clutch.co/il/it-services

To enrich the scraped data, with additional information, i tried to modify the scraping-logic to extract more details from each company's page. Here's i have to an updated version of the code that extracts the company's website and additional information:

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

url = "https://clutch.co/il/it-services"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Your scraping logic goes here
company_info = soup.select(".directory-list div.provider-info")

data_list = []
for info in company_info:
    company_name = info.select_one(".company_info a").get_text(strip=True)
    location = info.select_one(".locality").get_text(strip=True)
    website = info.select_one(".company_info a")["href"]
    
    # Additional information you want to extract goes here
    # For example, you can extract the description
    description = info.select_one(".description").get_text(strip=True)
    
    data_list.append({
        "Company Name": company_name,
        "Location": location,
        "Website": website,
        "Description": description
    })

df = pd.DataFrame(data_list)
df.index += 1

print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data_enriched.csv", index=False)

driver.quit()

ideas to this extended version: well in this code, I added a loop to go through each company's information, extracted the website, and added a placeholder for additional information (in this case, the description). i thougth that i can adapt this loop to extract more data as needed. At least this is the idea.

the working model: i think that the structure of the HTML of course changes here - and therefore in need to adapt the scraping-logik: so i think that i might need to adjust the CSS selectors accordingly based on the current structure of the page. So far so good: Well,i think that we need to make sure to customize the scraping logic based on the specific details we want to extract from each company's page. Conclusio: well i think i am very close: but see what i gotten back: the following

/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/bin/python /home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py
/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
       
 import pandas as pd
Traceback (most recent call last):
 File "/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py", line 29, in <module>
   description = info.select_one(".description").get_text(strip=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get_text'

Process finished with exit code 

and now - see below my allready working model: my approach to fetch some data form the given page: clutch.co/il/it-services

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

url = "https://clutch.co/il/it-services"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")

company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]

data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)

driver.quit()

import pandas as pd

+----+-----------------------------------------------------+--------------------------------+
|    | Company Name                                        | Location                       |
|----+-----------------------------------------------------+--------------------------------|
|  1 | Artelogic                                           | L'viv, Ukraine                 |
|  2 | Iron Forge Development                              | Palm Beach Gardens, FL         |
|  3 | Lionwood.software                                   | L'viv, Ukraine                 |
|  4 | Greelow                                             | Tel Aviv-Yafo, Israel          |
|  5 | Ester Digital                                       | Tel Aviv-Yafo, Israel          |
|  6 | Nextly                                              | Vitória, Brazil                |
|  7 | Rootstack                                           | Austin, TX                     |
|  8 | Novo                                                | Dallas, TX                     |
|  9 | Scalo                                               | Tel Aviv-Yafo, Israel          |
| 10 | TLVTech                                             | Herzliya, Israel               |
| 11 | Dofinity                                            | Bnei Brak, Israel              |
| 12 | PURPLE                                              | Petah Tikva, Israel            |
| 13 | Insitu S2 Tikshuv LTD                               | Haifa, Israel                  |
| 14 | Opinov8 Technology Services                         | London, United Kingdom         |
| 15 | Sogo Services                                       | Tel Aviv-Yafo, Israel          |
| 16 | Naviteq LTD                                         | Tel Aviv-Yafo, Israel          |
| 17 | BMT - Business Marketing Tools                      | Ra'anana, Israel               |
| 18 | Profisea                                            | Hod Hasharon, Israel           |
| 19 | MeteorOps                                           | Tel Aviv-Yafo, Israel          |
| 20 | Trivium Solutions                                   | Herzliya, Israel               |
| 21 | Dynomind.tech                                       | Jerusalem, Israel              |
| 22 | Madeira Data Solutions                              | Kefar Sava, Israel             |
| 23 | Titanium Blockchain                                 | Tel Aviv-Yafo, Israel          |
| 24 | Octopus Computer Solutions                          | Tel Aviv-Yafo, Israel          |
| 25 | Reblaze                                             | Tel Aviv-Yafo, Israel          |
| 26 | ELPC Networks Ltd                                   | Rosh Haayin, Israel            |
| 27 | Taldor                                              | Holon, Israel                  |
| 28 | Clarity                                             | Petah Tikva, Israel            |
| 29 | Opsfleet                                            | Kfar Bin Nun, Israel           |
| 30 | Hozek Technologies Ltd.                             | Petah Tikva, Israel            |
| 31 | ERG Solutions                                       | Ramat Gan, Israel              |
| 32 | Komodo Consulting                                   | Ra'anana, Israel               |
| 33 | SCADAfence                                          | Ramat Gan, Israel              |
| 34 | Ness Technologies | נס טכנולוגיות                         | Tel Aviv-Yafo, Israel          |
| 35 | Bynet Data Communications Bynet Data Communications | Tel Aviv-Yafo, Israel          |
| 36 | Radware                                             | Tel Aviv-Yafo, Israel          |
| 37 | BigData Boutique                                    | Rishon LeTsiyon, Israel        |
| 38 | NetNUt                                              | Tel Aviv-Yafo, Israel          |
| 39 | Asperii                                             | Petah Tikva, Israel            |
| 40 | PractiProject                                       | Ramat Gan, Israel              |
| 41 | K8Support                                           | Bnei Brak, Israel              |
| 42 | Odix                                                | Rosh Haayin, Israel            |
| 43 | Panaya                                              | Hod Hasharon, Israel           |
| 44 | MazeBolt Technologies                               | Giv'atayim, Israel             |
| 45 | Porat                                               | Tel Aviv-Jaffa, Israel         |
| 46 | MindU                                               | Tel Aviv-Yafo, Israel          |
| 47 | Valinor Ltd.                                        | Petah Tikva, Israel            |
| 48 | entrypoint                                          | Modi'in-Maccabim-Re'ut, Israel |
| 49 | Adelante                                            | Tel Aviv-Yafo, Israel          |
| 50 | Code n' Roll                                        | Haifa, Israel                  |
| 51 | Linnovate                                           | Bnei Brak, Israel              |
| 52 | Viceman Agency                                      | Tel Aviv-Jaffa, Israel         |
| 53 | develeap                                            | Tel Aviv-Yafo, Israel          |
| 54 | Chalir.com                                          | Binyamina-Giv'at Ada, Israel   |
| 55 | WolfCode                                            | Rishon LeTsiyon, Israel        |
| 56 | Penguin Strategies                                  | Ra'anana, Israel               |
| 57 | ANG Solutions                                       | Tel Aviv-Yafo, Israel          |
+----+-----------------------------------------------------+--------------------------------+

what is aimed: i want to to fetch some more data form the given page: clutch.co/il/it-services - eg the website and so on...

update_: The error AttributeError: 'NoneType' object has no attribute 'get_text' indicates that the .select_one(".description") method did not find any HTML element with the class ".description" for the current company information, resulting in None. Therefore, calling .get_text(strip=True) on None raises an AttributeError.

more to follow... later the day.

update2: note: @jakob had a interesting idea - posted here: Selenium in Google Colab without having to worry about managing the ChromeDriver executable - i tried an example using kora.selenium I made Google-Colab-Selenium to solve this problem. It manages the executable and the required Selenium Options for you. - well that sounds very very interesting - at the moment i cannot imagine that we get selenium working on colab in such a way - that the above mentioned scraper works on colab full and well!? - ideas !? would be awesome - ill test it later



from running bs4 scraper needs to be redefined to enrich the dataset - some issues

No comments:

Post a Comment