Wednesday 14 July 2021

What is the correct way of encoding a large batch of documents with sentence transformers/pytorch?

I am having issues to encode a large amount of documents (more than a million) with the sentence_transformers library.

Given a very similar corpus list of strings. When I do:

from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('msmarco-distilbert-base-v2')
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=False)
    

After some hours, the process seems to be stucked, as it never finishes and when checking the process viewer nothing is running.

As I am suspicios that this is a ram issue (the gpu board doesnt have enough memory to fit everything in a single step) I tried split the corpus into batches, transform them into numpy arrays and concat them into a single matrix as follows:

from itertools import zip_longest
from sentence_transformers import SentenceTransformer, util
import torch
from loguru import logger
import glob
from natsort import natsorted



def grouper(iterable, n, fillvalue=np.nan):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

embedder = SentenceTransformer('msmarco-distilbert-base-v2')

for j, e in enumerate(list(grouper(corpus, 3))):
    try:
#         print('------------------')
        for i  in filter(lambda v: v==v, e):
            corpus_embeddings=embedder.encode(i, convert_to_tensor=False)
            torch.save(corpus_embeddings, f'/Users/user/Downloads/embeddings_part_{j}.npy')
    except TypeError:
        print(j, e)
        logger.debug("TypeError in batch {batch_num}", batch_num=j)

l = []
for e in natsorted(glob.glob("/Users/user/Downloads/*.npy")):
    l.append(torch.load(e))
    corpus_embeddings = np.vstack(l)
corpus_embeddings

Nevertheless, the above proceedure doesnt seem to work. The reason, is that when I try with a small sample of the corpus with and without the batch approach the matrices I get are different for example:

Without batch approach:

array([[-0.6828216 , -0.26541945,  0.31026787, ...,  0.19941986,
         0.02366139,  0.4489861 ],
       [-0.45781   , -0.02955275,  1.0897563 , ..., -0.20077021,
        -0.37821707,  0.2248317 ],
       [ 0.8532193 , -0.13642257, -0.8872398 , ..., -0.57482916,
         0.12760726, -0.66986346],
       ...,
       [-0.04036704,  0.06745373, -0.6010259 , ..., -0.08174597,
        -0.18513843, -0.64744204],
       [-0.30782765, -0.04935509, -0.11624689, ...,  0.10423593,
        -0.14073376, -0.09206307],
       [-0.77139395, -0.08119706,  0.43753916, ...,  0.1653319 ,
         0.06861683, -0.16276269]], dtype=float32)

With batch approach:

array([[ 0.8532191 , -0.13642241, -0.8872397 , ..., -0.5748289 ,
         0.12760736, -0.6698637 ],
       [ 0.3679317 , -0.21968201,  0.9932826 , ..., -0.86282325,
        -0.04683857,  0.18995859],
       [ 0.23026675,  0.69587034, -0.8116473 , ...,  0.23903558,
         0.413471  , -0.23438476],
       ...,
       [ 0.923319  ,  0.4152724 , -0.3153545 , ..., -0.6863369 ,
         0.01149149, -0.51300013],
       [-0.30782777, -0.04935484, -0.11624689, ...,  0.10423636,
        -0.1407339 , -0.09206269],
       [-0.77139413, -0.08119693,  0.43753892, ...,  0.16533189,
         0.06861652, -0.16276267]], dtype=float32)

What is the correct way of doing the above batch procedure?

UPDATE

After inspecting the above batch proceedure, I found that I was able to get the same matrix output with and without the batching when I set to 1 the batch size of the above code (enumerate(list(grouper(corpus, 1)))). Therefore, my question is, what is the correct way of applying the encoder to a large set of documents?



from What is the correct way of encoding a large batch of documents with sentence transformers/pytorch?

No comments:

Post a Comment