Hemant Vishwakarma: How to use pandas .replace() with list of regexs while honoring list order?

Wednesday, 10 July 2019

How to use pandas .replace() with list of regexs while honoring list order?

I have 2 dataframes: one (A) with some whitelist hostnames in regex form (ie (.*)microsoft.com, (*.)go.microsoft.com...) and another (B) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace() method doesn't care about what order items are in for its to_replace and value args.

My data looks like this:

In [1] A
Out[1]: 
                                  wildcards  \
42   (.*)activation.playready.microsoft.com   
35    (.*)v10.vortex-win.data.microsoft.com   
40      (.*)settings-win.data.microsoft.com   
43            (.*)smartscreen.microsoft.com   
39             (.*).playready.microsoft.com   
38                     (.*)go.microsoft.com   
240                     (.*)i.microsoft.com   
238                       (.*)microsoft.com   
                                                 regex  
42   re.compile('^(.*)activation.playready.microsof...  
35   re.compile('^(.*)v10.vortex-win.data.microsoft...  
40   re.compile('^(.*)settings-win.data.microsoft.c...  
43       re.compile('^(.*)smartscreen.microsoft.com$')  
39        re.compile('^(.*).playready.microsoft.com$')  
38                re.compile('^(.*)go.microsoft.com$')  
240                re.compile('^(.*)i.microsoft.com$')  
238                  re.compile('^(.*)microsoft.com$')  


In [2] B.head()
Out[2]: 
                       server_hostname
146     mobile.pipe.aria.microsoft.com
205    settings-win.data.microsoft.com
341      nav.smartscreen.microsoft.com
406  v10.vortex-win.data.microsoft.com
667                  www.microsoft.com

Notice that A has a column of compiled regexes in similar form to the wildcards column. I want to add a wildcard column to B like this:

B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())

But the problem is, all of B's wildcard values become (.*)microsoft.com. This happens no matter the order of A's wildcard values. It appears .replace() aims to use the to_replace regex's by shortest value first rather than the order provided.

How can I provide a list of to_replace values so that I ultimately get the most details hostname wildcards value associated with B's server_hostname values?

from How to use pandas .replace() with list of regexs while honoring list order?

Hemant Vishwakarma

Wednesday, 10 July 2019

How to use pandas .replace() with list of regexs while honoring list order?

No comments:

Post a Comment