Monday 29 January 2024

running bs4 scraper needs to be redefined to enrich the dataset - some issues

got a bs4 scraper that works with selenium - see far below:

well - it works fine so far:

see far below my approach to fetch some data form the given page: clutch.co/il/it-services

To enrich the scraped data, with additional information, i tried to modify the scraping-logic to extract more details from each company's page. Here's i have to an updated version of the code that extracts the company's website and additional information:

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

url = "https://clutch.co/il/it-services"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Your scraping logic goes here
company_info = soup.select(".directory-list div.provider-info")

data_list = []
for info in company_info:
    company_name = info.select_one(".company_info a").get_text(strip=True)
    location = info.select_one(".locality").get_text(strip=True)
    website = info.select_one(".company_info a")["href"]
    
    # Additional information you want to extract goes here
    # For example, you can extract the description
    description = info.select_one(".description").get_text(strip=True)
    
    data_list.append({
        "Company Name": company_name,
        "Location": location,
        "Website": website,
        "Description": description
    })

df = pd.DataFrame(data_list)
df.index += 1

print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data_enriched.csv", index=False)

driver.quit()

ideas to this extended version: well in this code, I added a loop to go through each company's information, extracted the website, and added a placeholder for additional information (in this case, the description). i thougth that i can adapt this loop to extract more data as needed. At least this is the idea.

the working model: i think that the structure of the HTML of course changes here - and therefore in need to adapt the scraping-logik: so i think that i might need to adjust the CSS selectors accordingly based on the current structure of the page. So far so good: Well,i think that we need to make sure to customize the scraping logic based on the specific details we want to extract from each company's page. Conclusio: well i think i am very close: but see what i gotten back: the following

/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/bin/python /home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py
/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
       
 import pandas as pd
Traceback (most recent call last):
 File "/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py", line 29, in <module>
   description = info.select_one(".description").get_text(strip=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get_text'

Process finished with exit code 

and now - see below my allready working model: my approach to fetch some data form the given page: clutch.co/il/it-services

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

url = "https://clutch.co/il/it-services"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")

company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]

data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)

driver.quit()

import pandas as pd

+----+-----------------------------------------------------+--------------------------------+
|    | Company Name                                        | Location                       |
|----+-----------------------------------------------------+--------------------------------|
|  1 | Artelogic                                           | L'viv, Ukraine                 |
|  2 | Iron Forge Development                              | Palm Beach Gardens, FL         |
|  3 | Lionwood.software                                   | L'viv, Ukraine                 |
|  4 | Greelow                                             | Tel Aviv-Yafo, Israel          |
|  5 | Ester Digital                                       | Tel Aviv-Yafo, Israel          |
|  6 | Nextly                                              | Vitória, Brazil                |
|  7 | Rootstack                                           | Austin, TX                     |
|  8 | Novo                                                | Dallas, TX                     |
|  9 | Scalo                                               | Tel Aviv-Yafo, Israel          |
| 10 | TLVTech                                             | Herzliya, Israel               |
| 11 | Dofinity                                            | Bnei Brak, Israel              |
| 12 | PURPLE                                              | Petah Tikva, Israel            |
| 13 | Insitu S2 Tikshuv LTD                               | Haifa, Israel                  |
| 14 | Opinov8 Technology Services                         | London, United Kingdom         |
| 15 | Sogo Services                                       | Tel Aviv-Yafo, Israel          |
| 16 | Naviteq LTD                                         | Tel Aviv-Yafo, Israel          |
| 17 | BMT - Business Marketing Tools                      | Ra'anana, Israel               |
| 18 | Profisea                                            | Hod Hasharon, Israel           |
| 19 | MeteorOps                                           | Tel Aviv-Yafo, Israel          |
| 20 | Trivium Solutions                                   | Herzliya, Israel               |
| 21 | Dynomind.tech                                       | Jerusalem, Israel              |
| 22 | Madeira Data Solutions                              | Kefar Sava, Israel             |
| 23 | Titanium Blockchain                                 | Tel Aviv-Yafo, Israel          |
| 24 | Octopus Computer Solutions                          | Tel Aviv-Yafo, Israel          |
| 25 | Reblaze                                             | Tel Aviv-Yafo, Israel          |
| 26 | ELPC Networks Ltd                                   | Rosh Haayin, Israel            |
| 27 | Taldor                                              | Holon, Israel                  |
| 28 | Clarity                                             | Petah Tikva, Israel            |
| 29 | Opsfleet                                            | Kfar Bin Nun, Israel           |
| 30 | Hozek Technologies Ltd.                             | Petah Tikva, Israel            |
| 31 | ERG Solutions                                       | Ramat Gan, Israel              |
| 32 | Komodo Consulting                                   | Ra'anana, Israel               |
| 33 | SCADAfence                                          | Ramat Gan, Israel              |
| 34 | Ness Technologies | נס טכנולוגיות                         | Tel Aviv-Yafo, Israel          |
| 35 | Bynet Data Communications Bynet Data Communications | Tel Aviv-Yafo, Israel          |
| 36 | Radware                                             | Tel Aviv-Yafo, Israel          |
| 37 | BigData Boutique                                    | Rishon LeTsiyon, Israel        |
| 38 | NetNUt                                              | Tel Aviv-Yafo, Israel          |
| 39 | Asperii                                             | Petah Tikva, Israel            |
| 40 | PractiProject                                       | Ramat Gan, Israel              |
| 41 | K8Support                                           | Bnei Brak, Israel              |
| 42 | Odix                                                | Rosh Haayin, Israel            |
| 43 | Panaya                                              | Hod Hasharon, Israel           |
| 44 | MazeBolt Technologies                               | Giv'atayim, Israel             |
| 45 | Porat                                               | Tel Aviv-Jaffa, Israel         |
| 46 | MindU                                               | Tel Aviv-Yafo, Israel          |
| 47 | Valinor Ltd.                                        | Petah Tikva, Israel            |
| 48 | entrypoint                                          | Modi'in-Maccabim-Re'ut, Israel |
| 49 | Adelante                                            | Tel Aviv-Yafo, Israel          |
| 50 | Code n' Roll                                        | Haifa, Israel                  |
| 51 | Linnovate                                           | Bnei Brak, Israel              |
| 52 | Viceman Agency                                      | Tel Aviv-Jaffa, Israel         |
| 53 | develeap                                            | Tel Aviv-Yafo, Israel          |
| 54 | Chalir.com                                          | Binyamina-Giv'at Ada, Israel   |
| 55 | WolfCode                                            | Rishon LeTsiyon, Israel        |
| 56 | Penguin Strategies                                  | Ra'anana, Israel               |
| 57 | ANG Solutions                                       | Tel Aviv-Yafo, Israel          |
+----+-----------------------------------------------------+--------------------------------+

what is aimed: i want to to fetch some more data form the given page: clutch.co/il/it-services - eg the website and so on...

update_: The error AttributeError: 'NoneType' object has no attribute 'get_text' indicates that the .select_one(".description") method did not find any HTML element with the class ".description" for the current company information, resulting in None. Therefore, calling .get_text(strip=True) on None raises an AttributeError.

more to follow... later the day.

update2: note: @jakob had a interesting idea - posted here: Selenium in Google Colab without having to worry about managing the ChromeDriver executable - i tried an example using kora.selenium I made Google-Colab-Selenium to solve this problem. It manages the executable and the required Selenium Options for you. - well that sounds very very interesting - at the moment i cannot imagine that we get selenium working on colab in such a way - that the above mentioned scraper works on colab full and well!? - ideas !? would be awesome - ill test it later



from running bs4 scraper needs to be redefined to enrich the dataset - some issues

image reconstruction from predicted array (normalize - unnormalize array?)

I have two images, E1 and E3, and I am training a CNN model.

In order to train the model, I use E1 as train and E3 as y_train.

I extract tiles from these images in order to train the model on tiles.

The model, does not have an activation layer, so the output can take any value.

So, the predictions for example, preds , have values around preds.max() = 2.35 and preds.min() = -1.77.

My problem is that I can't reconstruct the image at the end using preds and I think the problem is the scaling-unscaling of the preds values.

If I just do np.uint8(preds) its is almost full of zeros since preds has small values.

The image should look like as close as possible to E2 image.

import cv2
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Conv2D, BatchNormalization, Activation, \
    Input, Add
from tensorflow.keras.models import Model
from PIL import Image

CHANNELS = 1
HEIGHT = 32
WIDTH = 32
INIT_SIZE = ((1429, 1416))

def NormalizeData(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data) + 1e-6)

def extract_image_tiles(size, im):
    im = im[:, :, :CHANNELS]
    w = h = size
    idxs = [(i, (i + h), j, (j + w)) for i in range(0, im.shape[0], h) for j in range(0, im.shape[1], w)]
    tiles_asarrays = []
    count = 0
    for k, (i_start, i_end, j_start, j_end) in enumerate(idxs):
        tile = im[i_start:i_end, j_start:j_end, ...]
        if tile.shape[:2] != (h, w):
            tile_ = tile
            tile_size = (h, w) if tile.ndim == 2 else (h, w, tile.shape[2])
            tile = np.zeros(tile_size, dtype=tile.dtype)
            tile[:tile_.shape[0], :tile_.shape[1], ...] = tile_
        
        count += 1
        tiles_asarrays.append(tile)
    return np.array(idxs), np.array(tiles_asarrays)


def build_model(height, width, channels):
    inputs = Input((height, width, channels))

    f1 = Conv2D(32, 3, padding='same')(inputs)
    f1 = BatchNormalization()(f1)
    f1 = Activation('relu')(f1)
    
    f2 = Conv2D(16, 3, padding='same')(f1)
    f2 = BatchNormalization()(f2)
    f2 = Activation('relu')(f2)
    
    f3 = Conv2D(16, 3, padding='same')(f2)
    f3 = BatchNormalization()(f3)
    f3 = Activation('relu')(f3)

    addition = Add()([f2, f3])
    
    f4 = Conv2D(32, 3, padding='same')(addition)
    
    f5 = Conv2D(16, 3, padding='same')(f4)
    f5 = BatchNormalization()(f5)
    f5 = Activation('relu')(f5)
   
    f6 = Conv2D(16, 3, padding='same')(f5)
    f6 = BatchNormalization()(f6)
    f6 = Activation('relu')(f6)
   
    output = Conv2D(1, 1, padding='same')(f6)

    model = Model(inputs, output)

    return model

# Load data
img = cv2.imread('E1.tif', cv2.IMREAD_UNCHANGED)
img = cv2.resize(img, (1408, 1408), interpolation=cv2.INTER_AREA)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = np.array(img, np.uint8)
#plt.imshow(img)
img3 = cv2.imread('E3.tif', cv2.IMREAD_UNCHANGED)
img3 = cv2.resize(img3, (1408, 1408), interpolation=cv2.INTER_AREA)
img3 = cv2.cvtColor(img3, cv2.COLOR_BGR2RGB)
img3 = np.array(img3, np.uint8)

# extract tiles from images
idxs, tiles = extract_image_tiles(WIDTH, img)
idxs2, tiles3 = extract_image_tiles(WIDTH, img3)

# split to train and test data
split_idx = int(tiles.shape[0] * 0.9)

train = tiles[:split_idx]
val = tiles[split_idx:]

y_train = tiles3[:split_idx]
y_val = tiles3[split_idx:]

# build model
model = build_model(HEIGHT, WIDTH, CHANNELS)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
              loss = tf.keras.losses.Huber(),
              metrics=[tf.keras.metrics.RootMeanSquaredError(name='rmse')])

# scale data before training
train  = train / 255.
val = val / 255.

y_train = y_train / 255.
y_val = y_val / 255.

# train
history = model.fit(train, 
                    y_train, 
                    validation_data=(val, y_val),
                    epochs=50)

# predict on E2
img2 = cv2.imread('E2.tif', cv2.IMREAD_UNCHANGED)
img2 = cv2.resize(img2, (1408, 1408), interpolation=cv2.INTER_AREA)
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB)
img2 = np.array(img2, np.uint8)

# extract tiles from images
idxs, tiles2 = extract_image_tiles(WIDTH, img2)

#scale data
tiles2 = tiles2 / 255.

preds = model.predict(tiles2)
#preds = NormalizeData(preds)
#preds = np.uint8(preds)
# reconstruct predictions
reconstructed = np.zeros((img.shape[0],
                          img.shape[1]),
                          dtype=np.uint8)

# reconstruction process
for tile, (y_start, y_end, x_start, x_end) in zip(preds[:, :, -1], idxs):
    y_end = min(y_end, img.shape[0])
    x_end = min(x_end, img.shape[1])
    reconstructed[y_start:y_end, x_start:x_end] = tile[:(y_end - y_start), :(x_end - x_start)]


im = Image.fromarray(reconstructed)
im = im.resize(INIT_SIZE)
im.show()

You can find the data here

If I use :

def normalize_arr_to_uint8(arr):
  the_min = arr.min()
  the_max = arr.max()
  the_max -= the_min
  arr = ((arr - the_min) / the_max) * 255.
  return arr.astype(np.uint8)


preds = model.predict(tiles2)
preds = normalize_arr_to_uint8(preds)

then, I receive an image which seems right, but with lines all over.



from image reconstruction from predicted array (normalize - unnormalize array?)

Friday 26 January 2024

How can I identify rectangles in an image when they are of different colours, outlines and sometimes very close to the background colour

I'm trying to extract rectangles from an image. These are digital stickies on a digital notepad. They can be any user configurable colour, including transparent with a border. I want to be able to input a jpg/png file and get back a list of each of the rectangles, their coordinates and the colour of the rectangle.

OpenCV with Python is the route that I want to use for this. Below is the example image, the intention is to detect all of the rectangles only and retrieve the above mentioned information.

Example Image for Extraction

I've done quite a lot of reading and been using the find contours method to try and achieve my goal however I'm not getting the desired result.

import cv2

# reading image
img = cv2.imread('images/example_shapes.jpg')

# converting image into grayscale image
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# setting threshold of gray image
_, threshold = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)

# using a findContours() function
contours, _ = cv2.findContours(
    threshold, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

i = 0

# list for storing names of shapes
for contour in contours:

    # here we are ignoring first counter because
    # findcontour function detects whole image as shape
    if i == 0:
        i = 1
        continue

    # cv2.approxPloyDP() function to approximate the shape
    approx = cv2.approxPolyDP(
        contour, 0.01 * cv2.arcLength(contour, True), True)

    if len(approx) == 4:
        cv2.drawContours(img, [contour], 0, (0, 0, 255), 5)

# displaying the image after drawing contours
# img = cv2.resize(img, (500, 500))
cv2.imshow('shapes', img)

cv2.waitKey(0)
cv2.destroyAllWindows()

This would only detect the 2 rectangles in the middle and gave the following: enter image description here

I had then attempted to adjust the threshold to be adaptive thresholding:

threshold = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 13, 7)

which produced the following result enter image description here

Neither approach seems to be able to detect the rectangles who are close together and also a close colour to the backround and neither detects the rectangles with a stroke. The adaptive thresholding also returns a lot of items that are irrelevant.

Any suggestions on how to approach would be very welcome!



from How can I identify rectangles in an image when they are of different colours, outlines and sometimes very close to the background colour