Hemant Vishwakarma: Efficiently find the indices of shared values (with repeats) between two large arrays

Tuesday, 21 June 2022

Efficiently find the indices of shared values (with repeats) between two large arrays

Problem description

Let's take this simple array set

            # 0,1,2,3,4,5
a = np.array([1,1,3,4,6])
b = np.array([6,6,1,3])

From these two arrays I want to get the indices of all possible matches. So for number 1 we get 0,2 and 1,2, with the complete output looking like:

0,2 # 1
1,2 # 1
2,3 # 3
4,0 # 6
4,1 # 6

Note that the arrays are (not yet) sorted neither do they only contain unique elements - two conditions often assumed in other answers (see bottom). The above example is very small, however, I have to apply this to ~40K element arrays.

Tried approaches

1.Python loop approach

indx = []
for i, aval in enumerate(a):
    for j, bval in enumerate(b):
        if aval == bval:
            indx.append([i,j])
# [[0, 2], [1, 2], [2, 3], [4, 0], [4, 1]]

2.Python dict approach

adict = defaultdict(list)
bdict = defaultdict(list)
for i, aval in enumerate(a): adict[aval].append(i)
for j, bval in enumerate(b): bdict[bval].append(j)

for val, a_positions in adict.items():
    for b_position in bdict[val]:
        for a_position in a_positions:
            print(a_position, b_position)

3.Numpy where

print(np.where(a.reshape(-1,1) == b))

4. Polars dataframes

Converting it to a dataframe and then using Polars

import polars as pl
a = pl.DataFrame( {'x': a, 'apos':list(range(len(a)))} )
b = pl.DataFrame( {'x': b, 'apos':list(range(len(b)))} )
a.join(b, how='inner', on='x')

"Big data"

On "big" data using Polars seems the fastest now with around 0.02 secs. I'm suprised that creating DataFrames first and then joining them is faster than any other approach I could come up with so curious if there is any other way to beat it :)

a = np.random.randint(0,1000, 40000)
b = np.random.randint(0,1000, 40000)

Using the above data:

python loop: 218s

python dict: 0.03s

numpy.where: 4.5s

polars: 0.02s

How related questions didn't solve this

Return common element indices between two numpy arrays, only returns the indexes of matchesin one of the arrays, not both
Find indices of common values in two arrays, returns the matching indices of A with B and B with A, but not the paired indices (see example)

Very surprised a DataFrame library is currently the fastest, so curious to see if there are other approaches to beat this speed :) Everything is fine, cython, numba, pythran etc.

from Efficiently find the indices of shared values (with repeats) between two large arrays

Hemant Vishwakarma