I have 2 dataframes: one (A) with some whitelist hostnames in regex form (ie (.*)microsoft.com, (*.)go.microsoft.com...) and another (B) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace() method doesn't care about what order items are in for its to_replace and value args.
My data looks like this:
In [1] A
Out[1]:
wildcards \
42 (.*)activation.playready.microsoft.com
35 (.*)v10.vortex-win.data.microsoft.com
40 (.*)settings-win.data.microsoft.com
43 (.*)smartscreen.microsoft.com
39 (.*).playready.microsoft.com
38 (.*)go.microsoft.com
240 (.*)i.microsoft.com
238 (.*)microsoft.com
regex
42 re.compile('^(.*)activation.playready.microsof...
35 re.compile('^(.*)v10.vortex-win.data.microsoft...
40 re.compile('^(.*)settings-win.data.microsoft.c...
43 re.compile('^(.*)smartscreen.microsoft.com$')
39 re.compile('^(.*).playready.microsoft.com$')
38 re.compile('^(.*)go.microsoft.com$')
240 re.compile('^(.*)i.microsoft.com$')
238 re.compile('^(.*)microsoft.com$')
In [2] B.head()
Out[2]:
server_hostname
146 mobile.pipe.aria.microsoft.com
205 settings-win.data.microsoft.com
341 nav.smartscreen.microsoft.com
406 v10.vortex-win.data.microsoft.com
667 www.microsoft.com
Notice that A has a column of compiled regexes in similar form to the wildcards column. I want to add a wildcard column to B like this:
B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())
But the problem is, all of B's wildcard values become (.*)microsoft.com. This happens no matter the order of A's wildcard values. It appears .replace() aims to use the to_replace regex's by shortest value first rather than the order provided.
How can I provide a list of to_replace values so that I ultimately get the most details hostname wildcards value associated with B's server_hostname values?
from How to use pandas .replace() with list of regexs while honoring list order?
No comments:
Post a Comment