Wednesday 21 July 2021

Building a function that can derive a new column of hashes based on other pd.DataFrame features

I have a pandas DataFrame of health insurance information - names, addresses, DOBs, etc.

I wrote a function that works on a single row:

    
def make_hash(partner: str, df: pd.DataFrame) -> str:
  """
  For Partner A, df (pd.DataFrame) must contain:
    health_plan_id: str 
    date_of_birth: dt.Timestamp
    first_name: str
  Other partners will have different feature names for hash input and require a new elif block, below.
  """
  if partner == 'Partner A':
    health_plan_id = str(df.loc[:,'ID'].item()).strip().encode()
    date_of_birth = str(dt.date(df.loc[:,'Date of Birth'].item())).encode()
    first_name = str(df.loc[:,'Member Name'].item()).split(",")[1].strip().encode()

    hash_input = health_plan_id + date_of_birth + first_name
    h = hashlib.sha256(string=hash_input).hexdigest()
    print(f"Input: {hash_input}. Result: {h}.\n")
    return h
  else:
    print("No hashing strategy defined for that partner.")

Which outputs (values changed for PII):

make_hash(partner="Partner A", df=df)

Input: b'B88845204081984-06-11MickeyMouse'. Result: 4d578e1acd7c670193448b84362095383cc13a24249f6c8c92816d79ec3c48d8.
Out[60]: '4d578e1acd7c670193448b84362095383cc13a24249f6c8c92816d79ec3c48d8'

But ideally it would derive a new column (ID) and add that '4d578e1acd...' value, as well. If I try to use this function on a DataFrame with > 1 row it gives me the error:

ValueError: can only convert an array of size 1 to a Python scalar

I'd like the function to be used in a lambda that can operate on a pd.DataFrame with an arbitrary number of rows, and expect output to be another pd.DataFrame with the same number of rows, but features + 1 (for the new ID column).

Is that possible? I see several similar questions, but I am not sure I can (or want to?) do this on an entire pd.Series given the function above will have some data-cleansing steps dependent on the value of partner...



from Building a function that can derive a new column of hashes based on other pd.DataFrame features

No comments:

Post a Comment