Wednesday, 21 June 2023

Use python AutoGPT and ChatGPT to extract data from downloaded HTML page

I already use Scrapy to crawl websites successfully. I extract specific data from a webpage using CSS selectors. However, it's time consuming to setup and error prone. I want to be able to pass the raw HTML to chatGPT and ask a question like

"Give me in a JSON object format the price, array of photos, description, key features, street address, and zipcode of the object"

Right now I run into the max chat length of 4096 characters. So I decided to send the page in chunks. However even with a simple question like "What is the price of this object?" I'd expect the answer to be "$945,000" but I'm just getting a whole bunch of text. I'm wondering what I'm doing wrong. I heard that AutoGPT offers a new layer of flexibility so was also wondering if that could be a solution here.

My code:

import requests
from bs4 import BeautifulSoup, Comment
import openai
import json

# Set up your OpenAI API key
openai.api_key = "MYKEY"

# Fetch the HTML from the page
url = "https://www.corcoran.com/listing/for-sale/170-west-89th-street-2d-manhattan-ny-10024/22053660/regionId/1"
response = requests.get(url)

# Parse and clean the HTML
soup = BeautifulSoup(response.text, "html.parser")

# Remove unnecessary tags, comments, and scripts
for script in soup(["script", "style"]):
    script.extract()

# for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
#     comment.extract()

text = soup.get_text(strip=True)

# Divide the cleaned text into chunks of 4096 characters
def chunk_text(text, chunk_size=4096):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i+chunk_size])
    return chunks

print(text)

text_chunks = chunk_text(text)

# Send text chunks to ChatGPT API and ask for the price
def get_price_from_gpt(text_chunks, question):
    for chunk in text_chunks:
        prompt = f"{question}\n\n{chunk}"
        response = openai.Completion.create(
            engine="text-davinci-002",
            prompt=prompt,
            max_tokens=50,
            n=1,
            stop=None,
            temperature=0.5,
        )

        answer = response.choices[0].text.strip()
        if answer.lower() != "unknown" and len(answer) > 0:
            return answer

    return "Price not found"

question = "What is the price of this object?"
price = get_price_from_gpt(text_chunks, question)
print(price)


from Use python, AutoGPT and ChatGPT to extract data from downloaded HTML page

No comments:

Post a Comment