Hemant Vishwakarma: Converting pseudo algorithm to python -> pandas code

I am trying to convert a pseudo code to pandas code. Would appreciate any help or guidance.

General idea is to come up with a function f to select rows from a toy example dataset which -> has 100 rows and 5 columns ["X", "Y", "Z", "F", "V"] randomly filled with numbers between [0, 500]. Apart from the data, the second input for the function is the columns cols_to_use it should use in selection for which default is to use all of them.

Description. Goal is to select 10 rows from the sample dataset. There are 5 probabilities for the second argument of the function -> selection based on [1, 2, 3, 4, 5] columns.

If all columns have to be used, then we select 2 rows per column. We select the rows corresponding to top 2 values per each column. There can be rows which will overlap during the initial selection. Lets call it overlap1 event. If overlap1 event happens, we randomly select a column for which we keep the overlapped row(s) and for the other(s) we add the 3rd. During this process there can be overlaps as well with the new selected ones and already selected ones -> call it overlap2 event. If overlap2 happens, use top 4th, top 5th and etc rows for that column. There is on average .25 probability that there will be at least one overlap during initial selection, so this is quite important to account for. Final selection must consist of 10 unique rows.

if there are 4 columns to base the selection upon, we select rows corresponding to the top 2 values per each of the columns and solve overlap1 event. But we still need to select 2 more rows. So, we randomly draw 2 columns from those 4 and for them we select additional row -> corresponding to the 3rd, or when overlap2 happens to the 4th and etc.

if there are 3 columns, select 3 rows per column as per the rule aforementioned + overlap1 solution if any, and randomly select a column for which we should add the remaining 1 option + solve overlap2 event

when 2 columns must be used, select 5 rows per column + overlap 1 and 2 events

when only 1 column must be used select top 10 rows corresponding to highest 10 values for that column

# sample dataset to work with

sample = pd.DataFrame(np.random.randint(0, 500, size = (100, 5)))
sample.columns = "X Y Z F V".split()

# the function I have written so far
def f(df, cols_to_use = ["X", "Y", "Z", "F", "V"]):

    how_many_per_feature = {
        5:2,
        4:2,
        3:3,
        2:5,
        1:10
    }
    n_per_group = how_many_per_feature[len(cols_to_use)]

    # columns to randomly choose when adding extra options
    # could not find a proper way to implement this
    
    if len(cols_to_use) == 4:
        randomly_selected_columns = random.sample(cols_to_use, 2)
    elif len(cols_to_use) == 3:
        randomly_selected_columns = random.sample(cols_to_use, 1)
    
    
    # first I filter the dataframe on columns I need
    filtered_df = df[cols_to_use]
    
    # using pandas melt to select top n_per_group
    
    
    result = col_filtered.melt(id_vars = "obj_function",
                        var_name = "feature",
                        ignore_index = False,
                        ).groupby("feature").value.nlargest(n_per_group)

    # here supposed to handle overlap1 events
    
    # here overlap2 events
                        
    index = result.reset_index().level_1.values
    
    return df.iloc[index,:]

I could not implement the dynamic selection based on overlap events' handling.

from Converting pseudo algorithm to python -> pandas code

Hemant Vishwakarma

Sunday 24 October 2021

Converting pseudo algorithm to python -> pandas code

No comments:

Post a Comment