Hemant Vishwakarma: fast partial match checking in R or python or Julia

Sunday, 3 April 2022

fast partial match checking in R or python or Julia

I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.

Create two dummy datasets

library(dplyr)
library(stringr)

set.seed(1)

data_set_A <- tibble(name =  unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_A = 1:n())
                    
set.seed(2)

data_set_B <- tibble(name_2 =  unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_B = 1:n())

Test matching of full matches only

# This is almost instant
data_set_A %>%
  rowwise() %>%
  filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
  ungroup()

# A tibble: 4 x 2
  name   ID_A
  <chr> <int>
1 vnt     112
2 fly     391
3 cug    1125
4 xgv    1280

Include partial matches (This is what I want to optimize)

This of course only gives me the subset of dataset A, but that is ok.

# This takes way too long
data_set_A %>%
  rowwise() %>%
  filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
  ungroup()

A tibble: 237 x 2
   name       ID_A
   <chr>     <int>
 1 wknrsauuj     2
 2 lyw           7
 3 igwsvrzpk    16
 4 zozxjpu      18
 5 cgn          22
 6 oqo          45
 7 gkritbe      47
 8 uuq          92
 9 lhwfyksz     94
10 tuw         100

Fuzzyjoin method.

This also works, but is equally slow

bind_rows(
  fuzzyjoin::fuzzy_inner_join(
    data_set_A,
    data_set_B,
    by = c("name" = "name_2"),
    match_fun = stringr::str_detect
  ) %>%
    select(name, ID_A),
  fuzzyjoin::fuzzy_inner_join(
    data_set_B,
    data_set_A,
    by = c("name_2" = "name"),
    match_fun = stringr::str_detect
  ) %>%
    select(name, ID_A)
) %>%
  distinct()

data.table solution

not much faster unfortunately

library(data.table)

setDT(data_set_A)
setDT(data_set_B)

data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) | 
                    any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]

from fast partial match checking in R or python or Julia

Hemant Vishwakarma