I am trying to reassign values in a pandas df.
For the df below, I want to reassign values in [Person]. This will be achieved by using various unique values in two separate columns, [Area] and [Place]. The overall aim to use the least amount of individuals as possible.
This is undertaken by trying to assign up to 3 unique values to any one individual.
I have another column labelled [On] that displays how many current unique values for [Place] and [Area] are occurring. So this provides a concrete guide on how many individuals I need. For example,
1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc
Question: Where the amount of unique values in [Area] and [Place] is greater than 3 is causing me trouble. I can't do a groupby where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc.
I want to keep the identical values in [Area] together. So assign all the same values in [Area] to an individual (up to 3). Then, the leftover values in [Area] should be combined to make a group of 3, where possible.
So this is the tricky point. There will be times when the same values in [Area] should be assigned to an individual and other times when it will be a leftover and combined with different values to an individual.
The way I envisage this working is to see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required).
Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.
I understand this is hard to conceptually understand. But I envisage the steps as:
1) Use the [On] Column to determine how many individuals are required when looking into the future for an hour
2) Where there are more than 3 unique values occurring assign by same values in [Area].
3) If there are any leftover values then look to combine anyway possible.
For the df below, there are 9 unique values occurring for [Place] and [Area] with an hour. So we should have 3 individuals assigned. When unique values >3 it should be assigned by [Area] by looking into the future by an hour and seeing if the same value occurs. The leftover values should be combined with other individuals that have less than 3 unique values.
import pandas as pd
import numpy as np
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
})
df = pd.DataFrame(data=d)
This is my attempt:
def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'
# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df
df1 = (reduce_df(reduce_df(df)))
This is the Output:
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 4
4 8:35:00 House 5 E 5 Person 5
5 8:40:00 House 1 D 6 Person 4
6 8:42:00 House 2 E 7 Person 5
7 8:45:00 House 3 F 8 Person 5
8 8:50:00 House 2 G 9 Person 7
This is my Intended Output:
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 8:40:00 House 6 D 6 Person 2
6 8:42:00 House 2 E 7 Person 3
7 8:45:00 House 3 F 8 Person 2
8 8:50:00 House 2 G 9 Person 3
Description on how I want to get this output:
Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there are other `E's` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
from Re-assign column values in a pandas df
No comments:
Post a Comment