I already use Scrapy to crawl websites successfully. I extract specific data from a webpage using CSS selectors. However, it's time consuming to setup and error prone. I want to be able to pass the raw HTML to chatGPT and ask a question like
"Give me in a JSON object format the price, array of photos, description, key features, street address, and zipcode of the object"
Right now I run into the max chat length of 4096 characters. So I decided to send the page in chunks. However even with a simple question like "What is the price of this object?" I'd expect the answer to be "$945,000" but I'm just getting a whole bunch of text. I'm wondering what I'm doing wrong. I heard that AutoGPT offers a new layer of flexibility so was also wondering if that could be a solution here.
My code:
import requests
from bs4 import BeautifulSoup, Comment
import openai
import json
# Set up your OpenAI API key
openai.api_key = "MYKEY"
# Fetch the HTML from the page
url = "https://www.corcoran.com/listing/for-sale/170-west-89th-street-2d-manhattan-ny-10024/22053660/regionId/1"
response = requests.get(url)
# Parse and clean the HTML
soup = BeautifulSoup(response.text, "html.parser")
# Remove unnecessary tags, comments, and scripts
for script in soup(["script", "style"]):
script.extract()
# for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
# comment.extract()
text = soup.get_text(strip=True)
# Divide the cleaned text into chunks of 4096 characters
def chunk_text(text, chunk_size=4096):
chunks = []
for i in range(0, len(text), chunk_size):
chunks.append(text[i:i+chunk_size])
return chunks
print(text)
text_chunks = chunk_text(text)
# Send text chunks to ChatGPT API and ask for the price
def get_price_from_gpt(text_chunks, question):
for chunk in text_chunks:
prompt = f"{question}\n\n{chunk}"
response = openai.Completion.create(
engine="text-davinci-002",
prompt=prompt,
max_tokens=50,
n=1,
stop=None,
temperature=0.5,
)
answer = response.choices[0].text.strip()
if answer.lower() != "unknown" and len(answer) > 0:
return answer
return "Price not found"
question = "What is the price of this object?"
price = get_price_from_gpt(text_chunks, question)
print(price)
from Use python, AutoGPT and ChatGPT to extract data from downloaded HTML page
No comments:
Post a Comment