Tuesday, 11 July 2023

Adding large chunks of text after embedding into pinecone without openai ratelimit in langchain

I am using langchain python to read data from a pdf, convert it into a chunks of text, embed the data into vectors and load it into a vector store .. here pinecone. I am getting maxretry error.

I guess I am loading all the chunks at once which may be causing the issue. is there some function like add_document which can be used to load data/chunks one by one. what am I doing wrong here. I am new to langchain.

def load_document(file):
    from langchain.document_loaders import PyPDFLoader
    print(f'Loading {file} ..')
    loader = PyPDFLoader(file)
    #the below line will return a list of langchain documents.1 document per page
    data = loader.load() 
    return data


data=load_document("DATA/capacitance.pdf")
#prints  content of second page
print(data[1].page_content)
print(data[2].metadata)

#chunking
def chunk_data(data,chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=0)
    chunks=text_splitter.split_documents(data)
    print(type(chunks))
    return chunks

chunks=chunk_data(data)
print(len(chunks))

Till chunking my code works well. It is able to load pdf convert to text and chunk the data as well. Now when it comes to embedding, I tried using Pinecone and FAISS. For pine cone I already created an index 'electrostatics'

pinecone.create_index('electrostatics',dimension=1536,metric='cosine')

import os
from dotenv import load_dotenv,find_dotenv
load_dotenv("D:/test/.env")
print(os.environ.get("OPENAI_API_KEY"))

def insert_embeddings(index_name,chunks):
    import pinecone
    from langchain.vectorstores import Pinecone
    from langchain.embeddings.openai import OpenAIEmbeddings

    embeddings=OpenAIEmbeddings()
    pinecone.init(api_key=os.environ.get("PINECONE_API_KEY"),environment=os.environ.get("PINECONE_ENV"))
    vector_store=Pinecone.from_documents(chunks,embeddings,index_name=index_name)
    print("Ok")

I tried embedding in the below 2 ways..

index_name='electrostatics'
vector_store=insert_embeddings(index_name,chunks)

I tried with FAISS as well

from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings=OpenAIEmbeddings()
db = FAISS.from_documents(chunks, embeddings)

enter image description here



from Adding large chunks of text after embedding into pinecone without openai ratelimit in langchain

No comments:

Post a Comment