Saturday, 10 November 2018

Re-assign column values in a pandas df

I am trying to reassign values in a pandas df.

For the df below, I want to reassign values in [Person]. This will be achieved by using various unique values in two separate columns, [Area] and [Place]. The overall aim to use the least amount of individuals as possible.

This is undertaken by trying to assign up to 3 unique values to any one individual.

I have another column labelled [On] that displays how many current unique values for [Place] and [Area] are occurring. So this provides a concrete guide on how many individuals I need. For example,

1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc

Question: Where the amount of unique values in [Area] and [Place] is greater than 3 is causing me trouble. I can't do a groupby where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc.

I want to keep the identical values in [Area] together. So assign all the same values in [Area] to an individual (up to 3). Then, the leftover values in [Area] should be combined to make a group of 3, where possible.

So this is the tricky point. There will be times when the same values in [Area] should be assigned to an individual and other times when it will be a leftover and combined with different values to an individual.

The way I envisage this working is to see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required).

Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.

I understand this is hard to conceptually understand. But I envisage the steps as:

1) Use the [On] Column to determine how many individuals are required when looking into the future for an hour

2) Where there are more than 3 unique values occurring assign by same values in [Area].

3) If there are any leftover values then look to combine anyway possible.

For the df below, there are 9 unique values occurring for [Place] and [Area] with an hour. So we should have 3 individuals assigned. When unique values >3 it should be assigned by [Area] by looking into the future by an hour and seeing if the same value occurs. The leftover values should be combined with other individuals that have less than 3 unique values.

import pandas as pd
import numpy as np

d = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],                 
    'Area' : ['A','B','C','D','E','D','E','F','G'],     
    'On' : ['1','2','3','4','5','6','7','8','9'], 
    'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],   
     })

df = pd.DataFrame(data=d)

This is my attempt:

def reduce_df(df):
    values = df['Area'] + df['Place']
    df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
    person_count = df1.groupby('Person')['Person'].agg('count')
    leftover_count = person_count[person_count < 3] # the 'leftovers'

    # try merging pairs together
    nleft = leftover_count.shape[0]
    to_try = np.arange(nleft - 1)
    to_merge = (leftover_count.values[to_try] + 
                leftover_count.values[to_try + 1]) <= 3
    to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
    to_merge = to_try[to_merge]
    merge_dict = dict(zip(leftover_count.index.values[to_merge+1], 
                leftover_count.index.values[to_merge]))
    def change_person(p):
        if p in merge_dict.keys():
            return merge_dict[p]
        return p
    reduced_df = df.copy()
    # update df with the merges you found
    reduced_df['Person'] = reduced_df['Person'].apply(change_person)
    return reduced_df

df1 = (reduce_df(reduce_df(df)))

This is the Output:

       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 4
4   8:35:00  House 5    E  5  Person 5
5   8:40:00  House 1    D  6  Person 4
6   8:42:00  House 2    E  7  Person 5
7   8:45:00  House 3    F  8  Person 5
8   8:50:00  House 2    G  9  Person 7

This is my Intended Output:

       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 2
4   8:35:00  House 5    E  5  Person 3
5   8:40:00  House 6    D  6  Person 2
6   8:42:00  House 2    E  7  Person 3
7   8:45:00  House 3    F  8  Person 2
8   8:50:00  House 2    G  9  Person 3

Description on how I want to get this output:

Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there are other `E's` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3



from Re-assign column values in a pandas df

No comments:

Post a Comment