I have a pandas DataFrame of health insurance information - names, addresses, DOBs, etc.
I wrote a function that works on a single row:
def make_hash(partner: str, df: pd.DataFrame) -> str:
"""
For Partner A, df (pd.DataFrame) must contain:
health_plan_id: str
date_of_birth: dt.Timestamp
first_name: str
Other partners will have different feature names for hash input and require a new elif block, below.
"""
if partner == 'Partner A':
health_plan_id = str(df.loc[:,'ID'].item()).strip().encode()
date_of_birth = str(dt.date(df.loc[:,'Date of Birth'].item())).encode()
first_name = str(df.loc[:,'Member Name'].item()).split(",")[1].strip().encode()
hash_input = health_plan_id + date_of_birth + first_name
h = hashlib.sha256(string=hash_input).hexdigest()
print(f"Input: {hash_input}. Result: {h}.\n")
return h
else:
print("No hashing strategy defined for that partner.")
Which outputs (values changed for PII):
make_hash(partner="Partner A", df=df)
Input: b'B88845204081984-06-11MickeyMouse'. Result: 4d578e1acd7c670193448b84362095383cc13a24249f6c8c92816d79ec3c48d8.
Out[60]: '4d578e1acd7c670193448b84362095383cc13a24249f6c8c92816d79ec3c48d8'
But ideally it would derive a new column (ID
) and add that '4d578e1acd...' value, as well. If I try to use this function on a DataFrame with > 1 row it gives me the error:
ValueError: can only convert an array of size 1 to a Python scalar
I'd like the function to be used in a lambda that can operate on a pd.DataFrame
with an arbitrary number of rows, and expect output to be another pd.DataFrame
with the same number of rows, but features + 1 (for the new ID
column).
Is that possible? I see several similar questions, but I am not sure I can (or want to?) do this on an entire pd.Series
given the function above will have some data-cleansing steps dependent on the value of partner
...
from Building a function that can derive a new column of hashes based on other pd.DataFrame features
No comments:
Post a Comment