Thursday, 8 October 2020

Unpacking sparse matrix performance tuning

I'm using the sparse_dot_topn library created by the Data Scientists at ING to search for near duplicates in a large set of company names (nearly 1.5M records). A recent update of this library now makes it possible to use multiple threads to compute the cross-product (i.e., the cosine similarity) between the two matrices. I ran a quick benchmark and the performance improvement is significant (depending on how many cores one can use on his machine/remote server):

+-----------+--------------+
| # threads | time (%M:%S) |
+-----------+--------------+
| 32        | 03:43:12     |
+-----------+--------------+
| 16        | 05:16:97     |
+-----------+--------------+
| 8         | 08:11:69     |
+-----------+--------------+
| 4         | 13:32:72     |
+-----------+--------------+
| 2         | 24:02:28     |
+-----------+--------------+
| 1         | 47:11:30     |
+-----------+--------------+

In order to easily explore the outcome, I needed to unpack the resulting sparse matrix. Fortunately, I found the following helper function written by Chris van den Berg which does exactly that (link to Chris's blog post here):

def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]

    return pd.DataFrame(
        {"left_side": left_side, "right_side": right_side, "similairity": similairity}
    )

The above function has an optional argument to look only at the first n values but I have to run it on the full data. My issue at the moment is that this takes a very long time to complete (roughly 1 hour).

Q: I was wondering how I could increase the performance here (if possible)? Especially since I have quite a few cores which I'm not using for this job.

I'm no expert when it comes to performance tuning. One option I explored is Numba. I decorated the function with @njit(parallel=True)and used Numba’s prange instead of range to specify that the loop can be parallelized but this failed. My understanding is that Numba can't handle string values (i.e., the company names in my case).

Any help on possible approaches to increase the performance would be greatly appreciated.



from Unpacking sparse matrix performance tuning

No comments:

Post a Comment