Wednesday, 7 October 2020

Filtering non-'cohorts' from dataset

I am sure this topic has been researched before, I am not sure what it is called or what techniques I should also look into, hence why I am here. I am running this mainly in Python and Pandas but it is not limited to those languages/technologies.

As an example, let's pretend I have this dataset:

| PID | A    | B    | C    |
| --- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 |
| 400 | 0.97 | 0.61 | 0.30 |
| 251 | 0.01 | 0.97 | 0.29 |
| 414 | 0.25 | 0.04 | 0.83 |
| 706 | 0.37 | 0.32 | 0.33 |
| 65  | 0.78 | 0.62 | 0.25 |
| 533 | 0.24 | 0.15 | 0.88 |

PID is a unique ID to that row. A, B and C are some factors (normalized for this example). This datasets could be players in a sports league over history, it could be products in an inventory, it could be voter data. The specific context isn't important.

Now let's say I have some input data:

| A    | B    | C    |
| ---- | ---- | ---- |
| 0.81 | 0.75 | 0.17 |

This input shares the same factors as the original dataset (A, B, C). What I want to do is to find the rows that are similar to my input data (the "cohorts"). What is the best way to approach this?

I thought of clustering, using a kNN algorithm, but the problem is that the number of cohorts is not set. You could have unique input and have few/no "cohorts", or you could have input that is very common and have hundreds of "cohorts".

The solution I next tried was Euclidean Distance. So for this dataset and input I would do something like:

my_cols = ['A', 'B', 'C']

inputdata = pd.Series([0.81, 0.75, 0.17], index=['A', 'B', 'C'])

# df = pandas data frame with above data

df['Dict'] = (df[my_cols] - inputdata).pow(2).sum(1).pow(0.5)

This would create a new column on the dataset like:

| PID | A    | B    | C    | Dist |
| --- | ---- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 | 0.27 |
| 400 | 0.97 | 0.61 | 0.30 | 0.25 |
| 251 | 0.01 | 0.97 | 0.29 | 0.84 |
| 414 | 0.25 | 0.04 | 0.83 | 1.12 |
| 706 | 0.37 | 0.32 | 0.33 | 0.63 |
| 65  | 0.78 | 0.62 | 0.25 | 0.16 |
| 533 | 0.24 | 0.15 | 0.88 | 1.09 |

You can then "filter" out those rows below some threshold.

cohorts = df[df['Dist'] <= THRESHOLD]

The issue then becomes (1) How do you determine that best threshold? and (2) If I add a 4th factor ("D") into the dataset and Euclid calculation, it seems to "break" the results, in that the cohorts no longer make intuitive sense, looking at the results.

So my question is: what are techniques or better ways to filter/select "cohorts" (those rows similar to an input row) ?

Thank you



from Filtering non-'cohorts' from dataset

No comments:

Post a Comment