Monday, 25 March 2019

How to improve Pairwise Euclidean Distance for Similarity Measure

I am trying to identify the most similar stations between two DataFrames like below:

stations      feature_1     feature_2   feature_3 ------ feature_10
------------------------------------------------------------------------
08GD008         10           1.14          98
08GE002          5           88.67         80
08MC040          8           4.61          17
08FB006          2           13.70         53       
08FC003          1           37            49
08LF002         20           2.5           30

I used a pairwise Euclidean distance and identified the minimum distance for each station to select the candidate (most similar one). I used (x-mean)/std to standardize my features ( they all have equal weights). I wanted to check if there is a way to improve my method.

I thought of using PCA (principal component analysis) to use PC1 and PC2 for my distance matrix but only 60% of variability is explained within the first two components. I was wondering if there is a way to assign weights to my features ( something like coefficient of variation of each feature multiplied by all standardized values)? or something similar to PCA in this case.



from How to improve Pairwise Euclidean Distance for Similarity Measure

No comments:

Post a Comment