I am having issues to encode a large amount of documents (more than a million) with the sentence_transformers library.
Given a very similar corpus list of strings. When I do:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('msmarco-distilbert-base-v2')
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=False)
After some hours, the process seems to be stucked, as it never finishes and when checking the process viewer nothing is running.
As I am suspicios that this is a ram issue (the gpu board doesnt have enough memory to fit everything in a single step) I tried split the corpus into batches, transform them into numpy arrays and concat them into a single matrix as follows:
from itertools import zip_longest
from sentence_transformers import SentenceTransformer, util
import torch
from loguru import logger
import glob
from natsort import natsorted
def grouper(iterable, n, fillvalue=np.nan):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
embedder = SentenceTransformer('msmarco-distilbert-base-v2')
for j, e in enumerate(list(grouper(corpus, 3))):
try:
# print('------------------')
for i in filter(lambda v: v==v, e):
corpus_embeddings=embedder.encode(i, convert_to_tensor=False)
torch.save(corpus_embeddings, f'/Users/user/Downloads/embeddings_part_{j}.npy')
except TypeError:
print(j, e)
logger.debug("TypeError in batch {batch_num}", batch_num=j)
l = []
for e in natsorted(glob.glob("/Users/user/Downloads/*.npy")):
l.append(torch.load(e))
corpus_embeddings = np.vstack(l)
corpus_embeddings
Nevertheless, the above proceedure doesnt seem to work. The reason, is that when I try with a small sample of the corpus with and without the batch approach the matrices I get are different for example:
Without batch approach:
array([[-0.6828216 , -0.26541945, 0.31026787, ..., 0.19941986,
0.02366139, 0.4489861 ],
[-0.45781 , -0.02955275, 1.0897563 , ..., -0.20077021,
-0.37821707, 0.2248317 ],
[ 0.8532193 , -0.13642257, -0.8872398 , ..., -0.57482916,
0.12760726, -0.66986346],
...,
[-0.04036704, 0.06745373, -0.6010259 , ..., -0.08174597,
-0.18513843, -0.64744204],
[-0.30782765, -0.04935509, -0.11624689, ..., 0.10423593,
-0.14073376, -0.09206307],
[-0.77139395, -0.08119706, 0.43753916, ..., 0.1653319 ,
0.06861683, -0.16276269]], dtype=float32)
With batch approach:
array([[ 0.8532191 , -0.13642241, -0.8872397 , ..., -0.5748289 ,
0.12760736, -0.6698637 ],
[ 0.3679317 , -0.21968201, 0.9932826 , ..., -0.86282325,
-0.04683857, 0.18995859],
[ 0.23026675, 0.69587034, -0.8116473 , ..., 0.23903558,
0.413471 , -0.23438476],
...,
[ 0.923319 , 0.4152724 , -0.3153545 , ..., -0.6863369 ,
0.01149149, -0.51300013],
[-0.30782777, -0.04935484, -0.11624689, ..., 0.10423636,
-0.1407339 , -0.09206269],
[-0.77139413, -0.08119693, 0.43753892, ..., 0.16533189,
0.06861652, -0.16276267]], dtype=float32)
What is the correct way of doing the above batch procedure?
UPDATE
After inspecting the above batch proceedure, I found that I was able to get the same matrix output with and without the batching when I set to 1
the batch size of the above code (enumerate(list(grouper(corpus, 1))))
. Therefore, my question is, what is the correct way of applying the encoder to a large set of documents?
from What is the correct way of encoding a large batch of documents with sentence transformers/pytorch?
No comments:
Post a Comment