Monday, 18 November 2019

Hausdorff distance for large dataset in a fastest way

Number of rows in my dataset is 500000+. I need Hausdorff distance of every id between itself and others. and repeat it for the whole dataset

I have a huge data set. Here is the small part:

df = 

id_easy ordinal latitude    longitude            epoch  day_of_week
0   aaa     1.0  22.0701       2.6685   01-01-11 07:45       Friday
1   aaa     2.0  22.0716       2.6695   01-01-11 07:45       Friday
2   aaa     3.0  22.0722       2.6696   01-01-11 07:46       Friday
3   bbb     1.0  22.1166       2.6898   01-01-11 07:58       Friday
4   bbb     2.0  22.1162       2.6951   01-01-11 07:59       Friday
5   ccc     1.0  22.1166       2.6898   01-01-11 07:58       Friday
6   ccc     2.0  22.1162       2.6951   01-01-11 07:59       Friday

I want to calculate Haudorff Distance:

import pandas as pd
import numpy as np

from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import pdist, squareform

u = np.array([(2.6685,22.0701),(2.6695,22.0716),(2.6696,22.0722)]) # coordinates of `id_easy` of taxi `aaa`
v = np.array([(2.6898,22.1166),(2.6951,22.1162)]) # coordinates of `id_easy` of taxi `bbb`
directed_hausdorff(u, v)[0]

Output is 0.05114626086039758


Now I want to calculate this distance for the whole dataset. For all id_easys. Desired output is matrix with 0 on diagonal (because distance between the aaa and aaa is 0):

     aaa      bbb    ccc
aaa    0  0.05114   ...
bbb    ...   0
ccc             0


from Hausdorff distance for large dataset in a fastest way

No comments:

Post a Comment