Sunday, 3 July 2022

Compare multiple columns with special characters and merge dataframes

I have one data frame as: (df1)

df1_ID    Col1_df1      Col2_df1    Col3_df1
ABC-001   a.102_103i    k159*       Test1
DEF-002   a.36-89E      k188        Test2
GHI-003   ab.23<<X      e542m       Test3

df2:

df2_ID1         df2_ID2    Count    Count_A  Count_B    To_Check
                ABC-001    10       0        10         FIRSTLINE:a.102_103i:ANYTHING:EXTRA
DEF-002;GHI-003            20       2        18         SECONDLINE:ab.23<<X:ANYTHING:EXTRA
ABC-001;DEF-002            15       3        12         THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA

RESULT (DF3):

df1_ID  Col1_df1    Col2_df1    Col3_df1    df2_ID1 df2_ID2 Count   Count_A Count_B To_Check
ABC-001 a.102_103i  k159*       Test1               ABC-001 10      0       10      FIRSTLINE:a.102_103i:ANYTHING:EXTRA:k159*
DEF-002 a.36-89     k188        Test2       ABC-001;DEF-002 15      3       12      THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA
GHI-003 ab.23<<X    e542m       Test3       DEF-002;GHI-003 20      2       18      SECONDLINE:ab.23<<X:ANYTHING:EXTRA

I want to check if the values of Col1_df1 and Col2_df1 are present in the df2 column of To_Check. If the value exists from either Col1_df1 and Col2_df1 AND df1_ID is present either in the df2_ID1 or in df2_ID2, then merge that row of df2 to df1. If it doesn't match, then it should be BLANK.

An extension of this question:

Vlookup function / merge Pandas but not exact match

But in this question, we were just dealing with strings. In my data, we have special characters also.

This syntax also doesn't seem to work when try to find the value that exists within df2:

df1 = df1.assign(result=df1['Col1_df1'].isin(df2['To_Check']))

Wrote another syntax also but not working either:

output = open("output.csv", "a")
with open("df1.csv", "r") as df1:
    first_line = df2.readline()
    output.write(first_line)
    with open("df2.csv", "r") as df2:
        second_first = df2.readline()
        output.write(second_first)
        for line_df1 in df1:
            df1_names = [x for x in line_df1.split(',')]
            for line_df2 in df2:
                df2_names = [x for x in line_df2.split(',')]
                check1 = any(df1_names[1] in string for string in df2_names[6])
                print(check1)

The check1 is always False although the value exists.

Thanks for your help in advance.

*UPDATE

data_1={'df1_ID':['ABC-001','DEF-002','GHI-003']
      ,'Col1_df1':['a.102_103i','a.36-89E','ab.23<<X']
      ,'Col2_df1':['k159*','k188','e542m']
      ,'Col3_df1':['Test1','Test2','Test3']}

data_2={'df2_ID1':['','DEF-002;GHI-003','ABC-001;DEF-002']
      ,'df2_ID2':['ABC-001','','']
      ,'Count':['10','20','15']
      ,'Count_A':['0','2','3']
        ,'To_Check':['FIRSTLINE:a.102_103i:ANYTHING:EXTRA','SECONDLINE:ab.23<<X:ANYTHING:EXTRA','THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA']}


from Compare multiple columns with special characters and merge dataframes

No comments:

Post a Comment